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Preface 



Automatic facial analysis is one of the most important problems in 
computer vision and applications of video analysis. Face recognition, 
though being only one part of facial analysis, has applications in areas 
such as access control, face databases, face ID card, human computer 
interaction, law enforcement, multimedia management, security, smart 
cards and surveillance. 

The material in this book is the result of about 10 years of the authors’ 
research on face detection, facial features detection, face recognition, face 
tracking, emotion recognition and 3-D face modelling. What is unique 
about it is its strong information theoretic foundation upon which the 
underlying algorithms are developed. Because of this strong foundation, 
the face detection algorithm using the “Information-Based Maximum 
Discrimination” criterion remains to be one of the fastest, robust face 
detection algorithms. In this book we give a detailed description of these 
algorithms. 

We also present a real time system for facial analysis for human- 
computer interface. Its characteristics are summarized as follows. First, 
the system is capable of detecting the presence of people in natural 
scenes with complex backgrounds. Then, it tracks the faces and their 
features in real time to produce information about the position, facial 
activity, and history of presence. Next, it analyzes the absolute position 
of the face in order to detect gestures such as shaking and nodding the 
head and moving the face forward or away from the camera. Finally, 
it recognizes faces and facial expressions based on a novel probabilistic 
framework in which faces are modelled by their appearances and their 
facial expressions. 

The book is designed for sharing our research results with graduate 
students, researchers in facial analysis and computer vision. We look for- 
ward to inquiries, collaborations, criticisms and suggestions from them. 
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Chapter 1 



INTRODUCTION 



The human face is a powerful means of communication. We use it 
constantly, naturally and effortlessly not only to identify one from an- 
other, but also to transmit information about our feelings. As a pattern, 
the face is a real challenge. Anatomically, all faces are similar in features 
and structure, yet we are very much different from each other. Similarly, 
the facial expression patterns are fairly standard among all races, yet we 
can identify individuals by their expressions. On the other hand, facial 
expressions and overall body position and motion are extremely com- 
plex spatio-temporal patterns capable of transmitting much information 
about a person’s state of mind and feelings. 

Vision-based pattern recognition algorithms for the continuous analy- 
sis of face-related video have to deal with the complexity in appearance 
and expression of the facial patterns, and whenever possible, the algo- 
rithms must use all available knowledge for this analysis. The overall 
goal and motivation of this book is to provide computers with the vision- 
based ability to detect, track, identify, and recognize facial expressions 
and gestures useful in human-computer interface. 

First of all, the computer should be able to detect the presence of 
faces and accurately track their location in complex environments such 
as exist in real applications. This task itself represents an interesting, 
challenging problem in computer vision, but it is just the starting point 
of all other face analysis algorithms. These algorithms are particularly 
helpful to interactive systems, because they provide the machine with 
the sense of awareness of the presence of users, thereby triggering alarms, 
starting applications, initializing systems, etc. 

The machine must also analyze nonrigid facial deformations. These 
are useful not only to recognize facial expressions or emotion in general, 
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but also to detect facial gestures. The analysis of nonrigid facial motion 
also finds a number of applications in other fields such as model-based 
video coding, video games and animation. Additionally, it is poten- 
tially useful in collaboration with other modalities in human-computer 
interface such as speech recognition, in which it is used to improve the 
robustness of the recognition against noise in the audio signal. 

Finally, all the facial information collected by this continuous pro- 
cessing of image sequences should be used for facial recognition based 
not only on the facial appearance as in a still picture recognition, but 
also on the facial expression patterns. The face model used for this pur- 
pose should capture information about faces and facial expressions from 
training videos. 

1. Facial Analysis for Human-Computer Interface 

This book presents computer vision algorithms and a real time system 
for facial analysis for human-computer interface. An overview of the 
system is shown in Figure 1.1. This vision system takes continuous 
video from a camera aimed at the user's face, and produces a series of 
events that provide the machine with information about the user. 

First of all, the system is capable of detecting the presence of people 
in natural scenes, i.e., in complex backgrounds. Then, the faces and 
their features are tracked in real time to produce information about the 
position, facial activity, and history of presence. For the purpose of very 
fast face and facial feature detection, we have developed a learning tech- 
nique based on information-theoretic discrimination measures. Faces 
and their features are detected in a maximum likelihood setup based on 
classifiers trained from examples. These classifiers have small compu- 
tational requirements, allowing us to implement a face tracking system 
that is fully automatic and performs in real time. In addition to locking 
onto the target faces, this system locates and tracks their nonrigid facial 
feature positions as they move under facial expression changes. 

We have also developed and implemented in real time an algorithm 
that analyzes the absolute position of the face in order to detect gestures 
such as shaking and nodding the head and moving the face forward or 
away from the camera. This algorithm also detects periods of high and 
low user activity. This algorithm is useful for sensing not only explicit 
events such as yes- and no-like facial gestures, but also for obtaining 
many other clues to the user’ s state of mind. The algorithm could poten- 
tially improve machine interfaces by adapting their behavior in response 
to user conduct. 

Next, we proposed an algorithm for the embedded recognition of faces 
and facial expressions based on a novel probabilistic framework in which 
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Expression Activity/ 

History 
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Figure 1.1. Facial analysis for intelligent human-computer interface: an overview 



faces are modelled by their appearances and their facial expressions. For 
a given image sequence, the algorithm finds the model and facial expres- 
sion that maximizes the likelihood probability. In this framework, face 
recognition is enhanced by facial expression modelling. Also, changes 
in the facial feature due to expressions are used together with facial 
deformation patterns to perform expression recognition. 

Finally, we describe the application of the face and facial feature detec- 
tion algorithms in a 3D Model-based image coding and communication 
system. We also present its application in a joint audio-visual person 
recognition system. 

Chapter 2 presents the aforementioned learning technique: the infor- 
mation based maximum discrimination. Chapters 3 and 4 describe the 
systems for face and facial feature detection and tracking, respectively. 
The algorithms for detecting facial gestures from the absolute position 
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of the faces is also described in Chapter 4. In Chapter 5, we present the 
probabilistic framework for the embedded recognition of faces and facial 
expressions. Chapter 6 presents a model based video analysis paradigm 
that uses the face and facial feature detection algorithms. Chapter 7 
explains in detail the different experiments and data sets used to evalu- 
ate the proposed techniques as well as the results of these performance 
evaluations. Chapter 8 describes the application of the algorithms in 
the task of audio-visual person recognition. Chapter 9 summarizes and 
provides future directions and concluding remarks. 




Chapter 2 



INFORMATION-BASED MAXIMUM 
DISCRIMINATION 



The task involved in pattern recognition is to make decisions about 
the unknown nature of observations. In this context, learning refers 
to the task of determining the nature of these observations such that 
pattern recognition can be achieved from them. 

An observation is a collection of measurements, or more formally, a 
d-dimensional vector x in some space S rf . The unknown nature of the 
observations, the class Y, is a finite set { C\ , C 2 , • • • , Cv} used to label 
the underlying process that generates these observations. A classifier is 
the mapping function g(x) : S d — > {Ci, C 2 , • • • , Cv}> and the question of 
how to find the best classifier, namely learning, is fundamentally solved 
by minimizing the probability of classification errors P[ 3 (X) f F]. 

In practice, the probability of classification errors cannot be com- 
puted. Other quantities are used as the optimization criteria for learn- 
ing. In this chapter, we present a novel learning technique based on 
information-theoretic divergence measures, namely, information-based 
maximum discrimination, the most outstanding characteristic of which 
is its exceptionally low computational requirements in both its training 
and testing procedures. 

Section 1 briefly introduces the Bayes classifier and sets up the math- 
ematical nomenclature. In Section 2, we describe in detail the proposed 
learning method. Last, in Section 3, we discuss issues regarding the 
speed and computational requirements of this learning technique and its 
classifiers. 

1. The Bayes Classifier 

Consider the random pair (X, F) € S d x {G\, C%, • ■ • , G\} with some 
probability distribution. Classification errors occur when g(X) f Y . 
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The best classifier g*, known as the Bayes classifier, is the one that min- 
imizes the probability of classification error. More formally, the Bayes 
classifier satisfies 

= argmin P{g(X)^Y}, (2.1) 

g:SMCi,C 2 ,-.,C N } 



and the probability of error L* = L(g*) = P{p*(X) 7^ Y}, known as the 
Bayes error, is a measure of the difficulty of classification for the given 
distribution. 

If the distribution of (X, Y) is known, the Bayes classifier can be com- 
puted. In most practical cases, if not all, such distributions are unknown, 
so one uses a collection of sample pairs {(Xi, Y), i = 1, 2, • ■ • , N}, re- 
ferred to as the training set, in order to construct a sub-optimal classifier 
g n (X\Xi, Yi, • • • ,X n ,Y n ). The process of constructing such a classifier 
given the training data is known as supervised learning, learning by ex- 
amples, or simply learning. 

An important assumption in learning by examples is the that the 
training examples (Xj, Y) are representative of all observations, or more 
formally, that (Xj,Y) are independent identically distributed (i.i.d.) 
random pairs with the same distribution as that of (X, Y). 



1.1 Two-class discrimination problem 

In a two-class discrimination problem, Y 6 {0,1}) the Bayes clas- 
sifier and Bayes error are often expressed in terms of the a posteriori 
probability P(Y|X) as 



f 1 ifP(Y = l|X = x)>P(Y = 0|X = x) 
{ 0 otherwise, 



( 2 . 2 ) 



and 



L* - L(g*) = E{mm(P(Y|X), 1 - P(Y|X))}. (2.3) 



In practical situations, only approximations of the class-conditional 
densities /o ~ /o(x) = P(X|Y — 0) and fi « /i(x) = P(X|Y = 1) are 
available. These, together with the approximations of the a priori prob- 
abilities of the classes pi ~ p — P(Y = 1) and p 0 ~ 1 — P = P(Y = 0), 
are used to construct the following classifier: 




if /i(x)// 0 (x) >po/pi 
otherwise. 



(2.4) 
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For this decision rule, the probability of error is bounded from above |2| 
by 

Mg) - L * < f |(1 - p)/o(x) - p 0 fo{x)\6x 4- f |p/i(x) - pifi(x)\Sx. 
Js d J s d 

(2.5) 

Therefore, if the used probability models fit the data well, and if the data 
is representative of the u nk nown distribution, then the performance of 
this approximation is not much different from that of the Bayes classifier. 

It is also important to note that if the a priori class probabilities, 
Pl(x) and po, cannot be estimated or are simply assumed equal, then 
the classifier of (2.4) turns into a maximum likelihood classifier. 

2. Information Theoretic-Based Learning 

The probability of error L(p) is the fundamental quantity to evaluate 
the discrimination capability of the data distribution; however, many 
other measures have been suggested and used as the optimization cri- 
teria for learning techniques. Nonetheless, the relation between these 
discrimination measures and the Bayes error has also been studied. For 
example, in support vector machine training [3], an “empirical risk” 
function is minimized in order to find the best classifier, and a bound of 
the misclassification error has been found. 

In the following section, we briefly describe the information-theoretic 
discrimination measurements that are of particular interest in developing 
our learning technique. These have been previously used in different sce- 
narios such as image registration and parameter estimation [4, 5, 6, 7]. 
The relations between the Bayes error and these divergence measure- 
ments have also been established; for further details in this subject, the 
reader is referred to [2, 8]. 

2.1 Kullback-Leibler divergence 

For the remainder of this chapter, let us consider the random pair 
(X, Y) with Y G {0, 1}, but let X G D d be a d-dimensional vector in 
some discrete space D = { 1 ,2,..., M). The Kullback-Leibler divergence 
between the two classes is then defined as 

I^xUog^y, (2.6) 

and its symmetric counterpart, the Jeffrey’s divergence, is defined as 



•> = E [(2tj(x) - 1) log [4^) 



(2.7) 




8 



Facial Analysis From Continuous Video With Applications 



where ij(x) = P(Y = 1 |x) is the a posteriori probability of the class 
Y = 1. To understand their meaning, note that both divergences are 
zero only when ij(x) = 1/2, and they go to infinity as r/(x) f 1 and 
r/(x) l 0. 

Once again, in the practical situation where the a priori probabilities 
of the classes cannot be estimated and are assumed equal, one uses only 
the class-conditional probabilities P(X|Y = 0) and P(X|Y = 1), and the 
Kullback-Leibler divergence and Jeffrey’s divergence can be computed 
as 



H(A) = £ {P(x|r = 1) - AP(x|K = 0)} l°B p^|y = pj . (2-8) 



where A is set to zero for the Kullback-Leibler divergence, and set to one 
for the Jeffrey’s divergence. 

This form of divergence is a nonnegative measure the difference be- 
tween these two conditional probabilities. It is zero only when the prob- 
abilities are identical, in which case the probability of error is ^ and 
goes to infinity as the probabilities differ from each other, in which case 
the probability of error goes to zero. Put differently, it measures the 
underlying discrimination capability of the nature of the observations 
described by these conditional probability functions. 

These information-theoretic discrimination measures are particularly 
attractive as the optimization criteria for our learning technique mainly 
because of the chain-rule property that the technique exhibits under 
Markov processes. This property allows us to find a practical solution 
to the learning problem. In the next section, we explore the Markov 
models and the computation of the divergence. 

2.2 Nonparametric probability models 

We use nonparametric, or rather discrete probability models to cap- 
ture the nature of the observation vector x; recall that x € T> d is a d- 
dimcnsional vector in some discrete space D = {1,2,..., M\. We have 
tested several families of models [9, 10]; however, we limit our discus- 
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sion here to the one with the most outstanding performance, a modified 
Markov model. 

Let us formally define a modified, /cth-order Markov process with its 
probability function 

k 

P(*i.--.»s„|S) =P(x«i) n P(z+ n \x tl ,...,Zg n _ l ) 

(2-9) 

P( a 'Sml 2 'S m _fc) • • • ) 

m=fc+l 



where S = {s* 6 [ 1 , n] : i = 1 , 2 , . . . , n : s, ± Sj Vi ^ j} is a list of indices 
used to rearrange the order of the elements of vector x. Note that such a 
model could be interpreted as a linear transformation, x' = Tx, followed 
by a regular Markov model applied in the transformed vector x', where 

T W={° otherwise. < 2 ' 10 ) 

Given the two conditional probabilities of these modified Markov pro- 
cesses, P(x|S,Y = 1) and P(x|S,Y = 0) satisfying (2.9), the divergence 
can be efficiently computed as 



H(S) — H(si)+ ^ H(s m jsi, . . . , s m _i)+ ^ ) H(s m |s m _fc, . . . , s m _i), 

m = 2 m=k + 1 

(2.11) 

where 

P(*i|Y = l) 

0) 






( 2 . 12 ) 



and 



MM M 



H(i|ji, . . . ,jk) = ££•••£ =i) 

Xi=is^=i X h =i (2.13) 

P{x i \xj l ,... i Xj k ,Y = 1 )] 



. io e = il l . 

8 P(®ikii) • •• Y = 0)J ’ 



2.3 Learning procedure 

Based on the previously described modified Markov models and the 
information theoretic discrimination measurements, we set the goal of 
the learning procedure: to find the best possible classifier for the given 
set of training examples. 
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Let g(x,S) be a maximum likelihood classifier such as that, in (2.4), 
but recall that x is a d-dimensional vector in some discrete space D: 



f 1 ifL(x|S)=Pi(x|S)/P 0 (x|S)>L th 
[ 0 otherwise, 



where Pi(x|S) and Po(x|S) are estimations of the first-order modified 
Markov class conditional probabilities P(x|S, Y — 1) and P(x|S, Y = 0), 
respectively, and L th is an estimation of the ratio of the a priori class 
probabilities P(T = 1) and P(F = 0). 

We use the statistics of the training set, individually and pair-wise on 
the dimensions of the space D, to compute the probabilities and, with 
them, the divergences H(i) and H (i,j) for i,j = 1, 2, . . . , d using (2. 12) 
and (2.13). Then, we set up the optimization procedure 

S* = argmaxH(S), (2.15) 

to find the sequence S* that maximizes the divergence 

n 

H(S) = H( Sl ) + (2.16) 

m = 2 



If H(s m \s m -i) is thought to be a distance from vertexs m tos m _i and 
H(s i) a distance from a fixed starting point (different from any of the n 
vertices 1, • • • , n) to vertex Si, then the physical meaning of H (S) is the 
maximum distance of a path starting from the fixed starting point, tra- 
verse each and every vertex 1 , • • • , n exactly once. The optimal solution 
gives the optimal traversing path si, • • • , s n . Figure 2.1 is an illustration. 

This is closely related to the NP-Complete “Travelling Salesman Prob- 
lem(TSP)” in graph theory where this optimal path is called the Hamil- 
tonian path [11]. Note that this optimization problem, similarly to the 
TSP, in practice would not be solved exhaustively. However, a modified 
version of the Kruskal’s Algorithm for minimum- weight spanning tree 
[11] [12] has shown be able to obtain sub-optimal, but very good results. 

We show the modified Kmskal’s Algorithm for minimum- weight span- 
ning tree using an example. In Figure 2.2, the left-hand side shows the 
full connected graph, this can be thought of as an example of the graph 
connected by solid-line arrows in Figure 2.1. The numbers, i.e. the 
weights, along each arrow are the relative entropy between the two ran- 
dom variables, from the starting node to the ending node. The edges 
are considered in the decreasing order of the weights. The heaviest edge 
is chosen first. This is shown on the right-hand side of Figure 2.2. 

After the first edge is chosen, all those edges sharing the same starting 
node or the ending node of the first edge are removed from the graph as 
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VI V2 




VI 



V2 



V3 




V3 



Vn 



Figure 2.1. Finding the optimal path starting from the fixed node S and travers- 
ing each and every node exactly once. Dashed-line arrows always start from node 
S. Solid-line arrows are bi-directional. Each arrow is associated with a weight(i.e, 
distance) . The sub-graph on the left with node Vi , • • • , V n is fully connected by the 
solid-line arrows. The one on the right is an illustration of the optimal path. 



4 





Figure 2.2. A Demonstration of the KruskaPs Algorithm. Left: the full connected 
graph. Right: The heaviest edge is chosen, as shown in thick solid line. 



they will violate the constraint that the path be a chain. Similarly the 
next two edges are chosen, as shown in Figure 2.3. 

On the left-hand side of Figure 2.4, the edge in segmented solid line is 
not chosen because otherwise a cycle will be formed, once again violating 
the chain constraint. Instead the next heaviest edge is chosen, as shown 
on the right-hand of Figure 2.4. 

The left-hand side of Figure 2.5 selects the next edge as usual. The 
segmented solid line is not chosen either in the right-hand side of Fig- 
ure 2.5 since a full chain has been found starting from node S in Fig- 
ure 2.1, traversing the path in solid line on the right -hand side of Fig- 
ure 2.5. 
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Figure 2. 4- Left: Step 4. Right: Step 5 




Figure 2.5. Left: Step 6. Right: Step 7 

Note that in practice, due to our selection of edges in the order of 
magnitude, we will obtain a set of short chains instead of a single long 
chain traversing all the nodes. 
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Once a suboptimal solution S’ is found, the classifier g(x , S') is im- 
plemented by a look-up table that holds the logarithm of the likelihood 
ratios log and log L s /| 3 /_ 1 for i = 2, . . . , d so that, given an observa- 
tion vector x, its log-likelihood ratio is computed as 

n 

logL(x|S / ) = log 1^(3^) + (2-17) 

m= 2 

2.4 Error bootstrapping 

Consider the problem of object detection as that of a two-class clas- 
sification. One of the classes, the object-class Y = 1 , corresponds to 
the object in question, while the other, the background class Y = 0, 
corresponds to the rest of the space. For a given set of observations, 
the classifier is used to decide which of them corresponds to the desired 
object, or to put it another way, the classifier detects the object in the 
observations. 

In this scenario, the object-class can be well represented by the train- 
ing examples; however, that is not the case with the background class. 
One tries to use as many and as diverse examples as possible to estimate 
the conditional probability of the background class. Doing so might 
cause the contribution of the background examples close to the object 
examples to be unfavorably weighted, resulting in a large probability of 
false-detection error for this class of observations. 

One widely used approach to overcome this limitation is called error 
bootstrapping. The classifier is first trained with all the examples in 
the training set. Then, it is used to further classify the training set, 
depending on the success of this classification. Then, the samples that 
were not successfully recognized are used separately in a second stage to 
reinforce the learning procedure. 

Information-based maximum discrimination learning can also benefit 
from error bootstrapping. Once the classifier is obtained with all the 
examples of the training set, and the training set has been evaluated with 
this classifier, statistics of the correctly classified examples are computed 
separately from those of the incorrectly classified examples. The new 
class-conditional probabilities are computed as 

P(x| a,Y = 1) = aP*(x|U = 1) + (1 - a)P c (x|y - 1) 

P(x|/3, Y = 0) = /3P l (x| Y = 0) + (1 - /3)P c (x|r = 0), (2.18) 

where P c is the probability of the correctly classified examples, P* is the 
probability of the incorrectly classified examples, and a and (3 are the 
mixing factor for each of the classes. Using these mixtures and (2.8), 
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the divergence can be computed as 

H(A, o, f>) = £[P(x|a, Y = 1) - AP(x|0, Y = 0)] log ° * -j. 

(2.19) 

Then, the new classifier is obtained by solving the maximization in 
(2.15). 

3. Fast Classification 

The learning technique previously described was developed in the con- 
text of discrete observation data, and due to implementation limitations, 
the number of outcomes M in the discrete space D = { 1 ,2,..., M } can- 
not be large. In this section, we discuss these limitations as well as the 
computational requirements of both learning and testing procedures. 

In order to hold the statistics, or the histogram of occurrence, of 
each pair of dimensions of the observations for each set of training data, 
M 2 d 2 parameters are required. Only one pass through the training data 
is required to capture the statistics, and the processing requirements for 
computing the divergence and finding the best sequence S are negligible. 
Consequently, the training procedure is incremental; if more data is 
added to the training set, such as in cases of adaptation, the training 
procedure does not need to be started from scratch. 

Once the classifier has been trained, only d(M 2 + 1) parameters are 
needed to store its knowledge. And since only the logarithms of the 
likelihood ratio functions are needed, and their range of variation is 
limited, fixed-point parameters can be used. It is also important to 
mention that only d fixed-point additions are required to perform the 
classification of an observation; that is, only one operation per dimension 
of the observation vectors is required to classify these observations. 

Chapter 3 will address how to preprocess the raw data to produce 
discrete observation vectors so that this learning technique can be used 
in different scenarios. The problem is discussed in the context of its 
application to face and facial feature detection. 




Chapter 3 



FACE AND FACIAL FEATURE 
DETECTION 



Visual detection of patterns is a problem of significant importance 
and difficulty. Automatic detection of targets is the first step in most 
automatic vision systems. In other fields, such as content-based retrieval, 
model-based coding, etc., robust and fast detection algorithms is also 
needed. If no motion or color information is available, and no previous 
knowledge about the desired object can be assumed, such as the size, 
pose, and number of instances in a scene, the target detection problem 
can become extremely difficult, if not impossible, because of the intense 
computation required. 

Object detection can be treated as a particular case of classification in 
which only two classes are dealt with: the class of objects to be detected 
and the class of all other background objects. Given a set of observation 
examples of both of these classes, the learning procedure is the selection 
of the best possible discriminant function that separates the observations 
according to their classes. Then, these discriminant functions are used 
for object detection. 

In the case of face detection, the classification between faces and back- 
ground objects is difficult mostly because of the diversity of variations in 
the face class such as racial characteristics, facial expressions, hair style, 
make-up, glasses, beard, etc. Similarly, the location of facial features is 
difficult because of variations among individuals (races, make-up styles, 
facial expressions, etc.). Additionally, there are other sources of varia- 
tions in the image formation process, such as illumination and head pose. 
Practical solutions are found by constraining the sources of variation and 
limiting the use of these algorithms to particular scenarios. 

In Section 1, we briefly review the approaches used to detect faces 
and facial features. Section 2 describes in detail our technique for face 
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detection in complex backgrounds and for facial feature detection. Fi- 
nally, in Section 3, we discuss other issues of importance regarding our 
implementation. 

1. Previous Approaches 

Many methods have been used for face detection. Here, we classify 
them based on the approach taken to deal with the invariance to scale 
and rotation. Invariance to scale and rotation in face detection has 
been addressed in three different ways. First, a bottom-up approach 
uses feature-based geometrical constraints [13]. Facial features are found 
using spatial filters, then they are combined to formulate hypothetical 
face candidates which are validated using geometrical constraints. 

Second, scale- and rotation-invariant face detection has been pro- 
posed using multi-resolution segmentation. Face candidates are found 
by merging regions until their shapes approximate ellipses. Next, these 
candidates are validated using facial features [14, 15, 16]. Skin color 
segmentation has been successfully used to segment faces in complex 
backgrounds. 

Last, most face detection algorithms use multi-scale searches with 
classifiers of fixed size. These classifiers are trained from examples using 
several learning techniques. Multi-scale detection is particularly useful 
when no color information is available for segmentation and when the 
faces in the test images are too small to use facial features. 

Most pattern recognition techniques have been used with multi-scale 
detection schemes for face detection. An early approach uses decision 
rules based on image intensity to formulate face candidates from a multi- 
scale representation of the input images. Then, these candidates are 
validated with edge-based facial features [17], Maximum likelihood clas- 
sifiers based on Gaussian models applied on a principal component sub- 
space have also been studied [18] in the context of face detection. Simi- 
larly, support vector machine classifiers have been successfully used [19] 
for face detection in complex backgrounds. 

Neural-network-based systems have also been successfully used for 
multi-scale face detection [20, 21, 22], More recently, a neural-network- 
based, rotation-invariant approach to face detection has been proposed 
[23]. In a multi-scale detection setup, sub-windows are first tested with 
a neural network trained to return the best orientation, and then the 
face detector is applied only at the given orientation. 

While the use of color images has proven to be very helpful for de- 
tecting skin, the general approach of segmentation and region merging 
is limited to produce a rough location of the face with almost no detail 
whatsoever about the features. On the other hand, feature-based face 
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detection techniques rely on the feature detection which in principle is 
as difficult as the original problem of face detection; consequently, these 
techniques are invariant to rotation and scale only within the range in 
which the feature detectors are invariant to rotation and scale. 

Schemes for face detection that use example-based pattern recognition 
techniques and that search at different scales and rotation angles are 
li mited mosdy by the computational complexity required to deal with 
large search spaces. However, the accuracy and performance of example- 
based recognition techniques allow these systems to locate facial features. 

The face and facial feature detection algorithm described in this chap- 
ter carries out a multi-scale search with a face classifier based on the 
learning technique described in the previous chapter. Then, using fa- 
cial feature classifiers, nine facial features are located at positions where 
faces were found. Finally, candidate validation is further carried out by 
combining the confidence level of individual feature detection with that 
of the face detection. 

2. IBMD Face and Facial Feature Detection 

In this section, we describe a face and facial feature detection algo- 
rithm that uses the information-based maximum discrimination classi- 
fiers presented in the previous chapter. Since these classifiers can deal 
with small amounts of rotation and scale variation, we use a multi-scale 
detection setup to handle faces at a wider range of scale. 

2.1 Multi-scale detection of faces 

Figure 3.1 illustrates the scheme for face detection based on multi- 
scale search with a classifier of fixed size. First, a pyramid of multiple 
resolution versions of the input image is obtained. Then, a face classifier 
is used to test each sub-window in a subset of these images. At each 
scale, faces are detected depending on the output of the classifier. The 
detection results at each scale are projected back to the input image 
with the appropriate size and position. One chooses the scales to be 
tested depending on the desired range of size variation allowed in the 
test image. 

Each window is preprocessed to deal with illumination variations be- 
fore it is tested with the classifier. A postprocessing algorithm is also 
used for face candidate selection. The classification is not carried out 
independently on each sub- window. Instead, face candidates are se- 
lected by analyzing the confidence level of sub- window classifications in 
neighborhoods so that results from neighbor locations, including those 
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Postprocessing 



Classification 



Preprocessing 




Figure 3.1. General scheme for multi-scale face detection 



at different scales, are combined to produce a more robust list of face 
candidates. 

In our implementation, we use a face classifier to test sub-windows 
of 16 x 14 pixels. The preprocessing algorithm consists of a histogram 
equalization and a re-quantization procedure so that four grey levels are 
used to feed the classifier. Examples of face images preprocessed at four 
grey levels are shown in Figure 3.2(c). The output of the classifier is the 
log likelihood of the observations. The greater this value, the more likely 
the observation corresponds to a face. For each sub-window tested, a 
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(c) Requantlze Images 

Figure 3.2. Examples of the image preprocessing algorithm 



log-likelihood map is obtained with the output of the function L(x|S) 
of the classifier in (2.14). The face classification decision is not made 
independently at every sub-window; instead, face candidates are selected 
by analyzing these log-likelihood maps locally. 

The face detection classifier was trained with a subset of 703 images 
from the FERET database [24, 25]. Faces were normalized with the loca- 
tion of the outer eye corners. Three rotation angles 9 = {5.0, 0.0, —5.0} 
and three scale factors s = {1.000, 0.902, 0.805} were used on each image 
to produce a total of 6327 examples of faces. Figure 3.2(c) shows one 
example of the images in the FERET database, the scale and rotation 
normalized images, and the four- grey-level training patterns. On the 
other hand, 44 images of a variety of scenes and their corresponding 
scaled versions were used to produce 423,987 examples of background 
patterns. 

We used this training data to compute the divergence of each pixel 
of the 224 = 16 x 14 element observation vector. Figure 3.3(a) is a 
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(a) Pair of Pixels 




(b) Independent 
Pixels 



(c) Best 
Sequence 



(d) Bootstraped 
Sequence 



Figure 3.3. Divergence of the training data 



16 2 x 14 2 image that shows the divergence of each pair of pixels as 
in (2.13). Similarly, Figure 3.3(b) shows the divergence of each pixel 
independently; it is computed as in (2.12). 

We sub-optimally solved the maximization in (2.15) using a greedy 
algorithm and obtained a sequence of index pixels with high divergence. 
Figures 3.3(c) and 3.3(d) show the divergence of the pixels in the se- 
quence found by our learning algorithm before and after error boot- 
strapping. Although the sequence itself cannot be visualized from these 
images, they show the divergence of the facial regions. Note that the 
eyes, cheeks and nose are the most discriminative regions of faces. 

2.2 Facial feature detection 

Once the face candidates are found, the facial features are located 
using classifiers trained with examples of facial features. Face detection 
is carried out at a very low resolution with the purpose of speeding up 
the algorithm. Facial features are located by searching with the feature 
detectors at, the appropriate positions in a higher resolution image, i.e., 
two or three images up in the multi-scale pyramid of images. 

In an early approach [10, 9], we used the same preprocessing algorithm 
and learning scheme described above to train both the face detector as 
well as the eye-corner detectors. The facial feature detection algorithm 
described here is based on a more complex discrete observation space 
that combines edges and pixel intensity. 

This new discrete space is formed by combining the results of three 
low-level image processing techniques: (i) binary quantization of the in- 
tensity, (ii) three-level of horizontal edges, and (iii) three-level of vertical 
edges. The threshold values used to re-quantize these low-level feature 
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Figure 3.J,, Low-level features for discrete observations 



images are based on a fixed percentage of the pixels in a region in the 
center of the face. Combining these sets of discrete images, we construct 
the discrete image 7 = 7; + 21 v + 67/ t that is used to locate the facial 
features. Figure 3.4 shows three sets of examples of these low-level fea- 
ture images. The left column in Figure 3.4 indicates the positions of the 
nine facial features used to train the facial feature detectors: the outer 
comers of each eye, the four corners of the eyebrows, the center of the 
nostrils and the corners of the mouth. 

We trained the classifiers using 150 images in which the feature po- 
sitions were located by hand. We used three rotation angles and three 
scale factors to produce the image examples of facial features. Negative 
examples were obtained from image sub-windows at neighbor locations 
around the corresponding feature positions. The relative locations of 
the facial features in these training images were also used to determine 
the size and location of the facial feature search areas of the detection 
procedure. 

Based on the individual performance of the feature detectors, we 
implemented a hierarchical facial feature detection scheme. First, the 
nostrils are located and used to center adjust the search areas of the 
other facial features. Then, the other facial features are detected. Fig- 
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Figure 3.5. Example of feature detection 



ure 3.5(a) shows a test image and the location of the facial features as 
detected by our algorithm. Figure 3.5(b) is the normalized window on 
which the preprocessing technique is carried out. Figure 3.5(c) shows 
the low-level feature images. Figure 3.5(d) shows the search areas and 
the log-likelihood maps of the search of each feature. 

3. Discussion 

In this chapter, we have described an application of information-based 
maximum discrimination learning. We used this technique for face and 
facial feature detection via maximum likelihood decision. Image exam- 
ples were used to obtain probability models that maximize the power to 
discriminate images of frontal view faces among these of backgrounds, 
and of facial features among these of their neighbor regions. These mod- 
els were used in a scale-invariant scheme for detection of faces and facial 
features. 

One of the most important issues in applying information-based max- 
imum discrimination learning is the technique used for image prepro- 
cessing and re-quantization. This is important because of the mapping 
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between the input image space and the discrete image space. This map- 
ping reduces the number of possible pixel values of the discrete image 
space while preserving the information useful for object discrimination 
so that discrete probability models can be implemented. In this chapter, 
we have described two approaches. Four-grey-level intensity was used 
for face detection while a combination of edges and intensity was used 
for facial feature detection. 

Another important issue in visual object detection is the choice of 
size of the image window used for detection. While large windows that 
include most of the desired object result in high detection performance, 
their location accuracy is poor. In our face and facial detection algo- 
rithm, we have presented a hierarchical scheme to improve detection 
performance and location accuracy. First, faces are detected using low- 
resolution image sub- windows that include most of the face. Then, the 
facial features are detected using image sub- windows at higher resolution 
that include only a portion of the facial feature being detected. 




Chapter 4 



FACE AND FACIAL FEATURE 
TRACKING 



The problem of tracking the face and other body parts in video se- 
quences has become a subject of research due to emerging applications 
in human-computer interface, surveillance, model-based video coding, 
computer games, and other fields. In this chapter, we present in detail 
an automatic, real-time system for detection and tracking of multiple 
faces and nin e facial features. 

Based on the output of this tracker, we have also developed an al- 
gorithm for analyzing the global position of faces. This can be used to 
detect shaking and nodding of the head, different viewing scenarios (such 
as near/distant, upright, lying down) and periods of high and low user 
activity. This analysis is useful to sense not only explicit events such 
as yes- and no-like gestures, but many other clues to the user’s state of 
mind. These could potentially improve machine interfaces by adapting 
their behaviors in response to the user’s conduct. 

In Sections 1 and 2, we present a general description of tracking sys- 
tems and a brief review of previous approaches for face tracking. Next, 
on Section 3, we describe in detail our face and facial feature tracking 
system. Finally, Section 4 describes our algorithm for the analysis of the 
global head position. 

1. Visual Object Tracking and Motion Analysis 

Visual object tracking and motion analysis from video sequences are 
areas of great importance in computer vision. The former addresses 
the problem of locking onto the object in question in an image sequence 
despite its changes in pose, size, illumination, and even appearance. The 
latter, motion analysis, is concerned with estimation of nonrigid motion 
within the parts of the object being tracked. 
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Tracking Loop 




Figure J t .l. General scheme for visual tracking and motion analysis 



There are a number of approaches to object tracking, but their appli- 
cations are limited because they neglect nonrigid motion within object 
parts. In more complex tracking systems, object tracking and motion 
analysis cooperate to produce more complex representations of the mo- 
tion of the tracked object and to improve the robustness of the system 
against large appearance changes of the object in question. 

Figure 4.1 illustrates a general scheme for object tracking and motion 
analysis. Note that we have highlighted two different loops: (i) the 
tracking loop, which executes at the frame rate of the input video, and 
(ii) the initialization loop, which executes only at the beginning or when 
the confidence level of the tracking loop drops bellow some acceptable 
level. 

The tracking loop is further divided in two steps: feature matching 
and model fitting. The feature matching step is responsible for locat- 
ing object features using image processing and pattern recognition al- 
gorithms, while the model fitting step imposes geometrical constraints 
and combines the individual results of the feature locations based on the 
knowledge provided by the object model. 

In most approaches, the initialization procedure is not automatic. 
The locations of the features are specified by hand, or the user interacts 
with the system to position the features in preset locations. Then, the 
geometry of the model as well as the data used in the feature matching 
step are adjusted to perform the tracking in the subsequent frames. 

In the next section, we compare several tracking approaches on the 
basis of this general scheme. The following sections describe our face and 
facial feature detection system and one application in human-computer 
interface. 
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2. Previous Approaches 

In early tracking systems [26, 27, 28], the feature matching step was 
carried out from one frame to the next using optical flow computations, 
resulting in drifting errors accumulating over long image sequences. In 
later techniques, feature texture information is gathered during initial- 
ization, so the feature matching step is carried out with respect to the 
initialization frame to overcome drifting. 

In order to deal with large out-of-plane rotations, a 3D model of the 
geometry of the face has been used together with the texture obtained 
from the initialization step to achieve 3D pose estimation simultaneously 
with face tracking in an analysis-by-synthesis scheme [29, 30]. In this 
approach, the 3D model is used to create the templates by rendering the 
texture given the head pose so that the feature matching step performs 
well on large out-of-plane rotations. However, this system requires the 
3D model of the person’s head/face. 

A wire-frame model capable of nonrigid motion has also been used to 
analyze facial expressions together with the global position of the face 
[31]; however, the templates used in the feature matching algorithm do 
not adapt according to the nonrigid deformation or global head position, 
resulting in poor accuracy on extreme expressions and large out-of-plane 
rotations when the templates and the input images do not match well. 
In this approach, a piece-wise linear deformation model is used to con- 
strain the non-rigid motion into a subspace of deformations established 
beforehand. In a more complex scheme [32], optical flow constraints are 
used together with a wire-frame model to track rigid and nonrigid mo- 
tion and adapt the wire-frame model to fit the person’s head. One of 
the most serious limitations in the wire-frame approaches is the fitting 
of the wire-frame model to the face in the initialization frame; this task 
involve the accurate location of many facial feature points and is carried 
out by hand. 

Other approaches of current interest are those based on “blobs,” where 
the face and other body parts are modelled with 2D or 3D Gaussian 
distributions of pixels. Pixels are clustered by their intensity [33] or 
color [34], or even by disparity maps from stereo images [35]. Although 
these techniques fail to capture nonrigid facial motion, they are easily 
initialized and operate very efficiently, especially even in sequences with 
moderate occlusion. 

In general, algorithms that use complex wire-frame models provide 
a framework for high-level motion analysis of nonrigid facial motion. 
However, these complex models need to be customized to the face be- 
ing tracked during a similarly complex initialization procedure. At the 
other end of the spectrum, algorithms based on simple models, such as 
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blobs, have proven to be feasible. Their simple initialization procedures 
and low computational requirements allow them to run in real-time on 
portable computers, but they are limited in the amount of information 
they extract from the object parts. 

The next step would be to combine these two schemes in a hierarchi- 
cal approach that would benefit from both; however, the gap between 
the two schemes is too wide to be bridged since the complex models 
still must be initialized with person-independent accurate location of fa- 
cial features. The face and facial feature tracking algorithm described 
here stands somewhere in between these two schemes. Faces and facial 
features are detected and tracked using person-independent appearance 
and geometry models that can be easily initialized and efficiently imple- 
mented to perform in real time. Nine facial features of multiple people 
are tracked; these account for global positions of the heads as well as for 
nonrigid facial deformations. 

3. Face and Facial Feature Tracking 

In this section, we describe in detail a face and facial feature tracking 
system based on information-based maximum discrimination classifiers. 
These facial feature classifiers are used not only in the initialization 
step, but also in the feature matching step in the main tracking loop. 
The tracking scheme is invariant to rotation and scale so that once the 
tracker locks onto the faces, they are allowed to get close to or far from 
the camera and to rotate freely. 

The input to the system is continuous video. In addition to providing 
a log of the time when people step in and out of the field of vision and the 
location of the faces and nine facial features as they are being tracked, 
the system performs temporal analysis of the global position of the faces 
to detect, head shaking and nodding. 

3.1 Feature matching 

Feature matching is carried out using maximum likelihood decisions 
and our information-based maximum discrimination classifiers. Since 
the classifiers are training to perform within a limited range of variation 
in scale and rotation, we developed a scheme for rotation- and scale- 
invariant tracking. Using the predicted positions and sizes of the faces 
obtained from those of the previous frames, the positions and sizes of 
the searching areas are adjusted on every frame. 

As illustrated in Figure 4.2, the predicted position, orientation and 
size of a face is used to project a region of the input frame into an image 
where the face is normalized in size and rotation. Feature matching is 
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Figure 4.2. Hierarchical, scale- and rotation-invariant feature location scheme 



performed in this image. The locations of the features are then projected 
back to the input frame. We use bilinear interpolation to obtain this 
normalized image and allow the scaling factor to be both greater than 
and less than one so that faces can be Hacked even when their sizes in 
the input frame are less than that required by the classifiers. 

In order to deal with fast motion from one frame to another, a hi- 
erarchical matching procedure was implemented. At half resolution, a 
classifier first detects the whole face in a search area of size similar to that 
of the face. Then, the rest of the features are located at full resolution 
using much smaller search areas in appropriate positions relative to that 
of the face. It is important to mention that since the face detector per- 
forms very well and quickly locates the face in such a large search area, 
high-order motion predictors were not helpful in this tracking system. 

3.2 Model fitting 

In this tracking system, models of the facial feature appearances are 
captured by the classifiers. Geometrical constraints of the relative po- 
sitions of the facial features are obtained from the training examples. 
They consist of the location and size of the search areas of the feature 
detectors in the aforementioned normalized image. Consequently, the 
feature location accuracy of this system is the same as that of the fea- 
ture classifiers, and the confidence level of the tracking procedure is a 
combination of those of the feature detectors. 

3.3 Initialization 

Ideally, the initialization of the main tracking loop in a system de- 
signed to track one object at a time is the location of the object at 
the first frame. In practice, some constraints are imposed to limit the 
search space. Our detection algorithm used for initialization is based on 
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Figure 4-3. Initialization in a real-time tracking system 



the face and facial feature classifiers described in the previous section. 
The algorithm searches for faces using frontal views with upright orien- 
tations (less than ±6.0° rotation) and size ranging from 25 to 407 pixels 
to locate the eye comers. This pair of position vectors is sufficient to 
establish the location, size and rotation angle of the face. 

If real-time operation is not needed, this initialization procedure is 
very simple. Once object detection is completed in one frame, the track- 
ing loop takes over starting with the next frame. For a tracking sys- 
tem to deal with multiple faces and perform in real-time, a much more 
complex initialization scheme is required to deal with detection latency, 
synchronization, etc. 

Figure 4.3 shows a diagram of how our tracking system operates. 
The main tracking loop, indicated as “real-time tracking,” executes syn- 
chronously with the video frame grabber, processing one frame per pe- 
riod. Initialization is needed much less frequently and consists of two 
steps: “object detection” and “behind tracking” (tracking of the frames 
left behind). 

Assume that at time t = 0, the initialization procedure is started with 
an execution of the face detection algorithm. By time ti when the face 
detection is completed, a total of N — 1 1 /7’vidco frames have passed and 
the face might have moved too far from its original position at frame 
t = 0 for tracking to work. During the time period indicated as “behind 
tracking,” the tracking algorithm executes asynchronously and as fast 
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as possible, starting from the frame t = 1 until it catches up with the 
video frame grabber. Then, real-time operation resumes. 

In addition to a slightly more complex control unit, this initialization 
procedure requires a buffer to hold the frames that cannot be processed 
in real-time. For this system to work in real-time, the tracking algorithm 
needs to be much faster than the actual video rate. 

The delay from the instant at which a face is detected and the time 
when it begins to be tracked in real-time, 7d e iay — ^2> can be computed 
from 



Tdelay — f ^detection 



^video 



^video ^tracking 



(4.1) 



where 7d e t ec tion and ^tracking arc the time required for detection and 
tracking in one frame and T v ideo is the period between frames of the 
video. One also has to consider that the initialization procedure is not 
carried out continuously, so there is an extra delay from the time when 
people get into the vision field until the time when face detection is 
executed. 

In a system intended to track a single object, the initialization pro- 
cedure can be executed as often as possible while no object is being 
tracked, and it can be turned off once the system locks onto the object. 
Complications emerge when multiple objects are to be tracked. First, 
the initialization procedure should execute continuously unless the max- 
imum number of objects is being successfully tracked. This guarantees 
that new objects entering the vision field are locked onto, even when the 
system is already tracking other objects. However, this can be an enor- 
mous waste of resources, especially if the expected number of objects to 
be tracked is always less than the maximum handled by the system. In 
our system, we force the initialization procedure to execute only once 
every Ti n it seconds, and to stop once the maximum number of tracked 
faces IVfaces is reached. Default values of Ti„it = 1 s and lVf aces = 4 pro- 
duce a total latency of less than 2 s and leave plenty of processing power 
available in the computer for other tasks. 

A simpler problem is due to the fact that at the end of the face 
detection step, the list of face candidates includes those that are being 
successfully tracked. Therefore, the system has to implement a rejection 
scheme to avoid having the same face tracked more than once in the 
system. Based on the assumption that two people cannot be in the 
same place at the same time, if the position of two faces is very close in 
relation to their sizes, the system rejects the face candidate with a lower 
detection/tracking confidence level. 
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4 . Global Head Position Analysis 

We have implemented an algorithm for analysis of the global trajec- 
tory of the face and head. It assumes that the camera position is fixed, 
so that the 2D motion of the face and its features in the video sequence 
reflects the actual motion of the head. The result is the detection of 
head shaking and nodding, moving toward and away from the machine, 
and periods of high and low activity. 

In order to detect head shaking and nodding, a window of of 2 s of 
frames is analyzed every second to detect sinusoidal motion indepen- 
dently on each axis. Then, a rule-based system is used to decide whether 
the person’s head is shaking, nodding, staying stationary or moving er- 
ratically. The head position is taken to be midway between the outer 
comers of the eyes, and motion over the analyzed window is normalized 
using the average rotation angle and distance between the outer comer 
of the eyes. 

More formally, let x* = (x l r) y' r ) and x| = (x\ ,y\) for i = 1 , . . . , TV 
be the positions of outer comers of the the right and left eyes at the 
ith frame. We first obtain the face center x t — (xj — x')/2, size di = 
||xj — x* ||, and rotation angle a; = Z(x| - xj.). 

We normalized the position vectors as 



, _ 1 cos(a) -sin(ct) 
1 d sin (a) cos (a) 



-x), 



(4.2) 



where x, d, and a are the means of the positions, sizes and rotation 
angles of the face over the N frame window. 

Next, we compute the spectrum of each axis independently via fast 
Fourier transform (FFT) to obtain the frequency lT pea ^ and value 5 pea k 
of the peak of the spectrum, the ratio between the first and the second 
harmonic 52nd/ Speak, and the signal energy E. We use a set of rules in 
order to detect a sync spectrum of the signal: 




Asin(VF£ — <f>) 
0 



if £ 6 [0, M], 
otherwise. 



(4.3) 



We conclude that the person is shaking the head if the following condi- 
tions are satisfied: 



■ The :r-axis shows a sync-like spectrum: 

5 p eak( a; ) > Th\ and 52nd /^peak^) ^ Th?- 



■ The frequency of the first harmonic of the sync-like spectrum is within 
some predefined range 



W^peak € [W/nini Wjnax] • 
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■ The y- axis shows no significant activity: 

E(y) <Th, 3 . 

Head nodding is detected by switching the x and y axes. In addition 
to these rules, changes in face size are used to determine whether the 
person is moving away from or toward the computer, and the average 
energy of the motion of the face is used as a measure of user activity. 

The threshold values for these mles were set by hand using examples 
of head shaking and nodding of each person in our face video database. 
A detailed description of this database is given on Chapter 7. Detection 
rates of 96% and 87% with zero false detections were obtained for head 
nodding and shaking, respectively. Detection errors are mainly due to 
(i) shakes and nods that are only one half of a cycle, which the system 
fails to detect completely, and (ii) failures in face tracking due to a large 
rotation in depth while shaking the head. Aside from these limitations, 
the overall perceptual performance of the system when used in real time 
is excellent. 
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FACE AND FACIAL EXPRESSION 
RECOGNITION 



In recent years, there has been a great deal of research on face recog- 
nition and facial expression recognition, but these two topics have been 
treated independently. The goal in most face recognition approaches 
is to find a similarity measure invariant to illumination changes, head 
pose and facial expressions so that images of faces can be successfully 
matched in spite of these variations. On the other hand, the goal of 
expression recognition is to find a model for nonrigid patterns of facial 
expression so that expressions can be classified. 

Although many techniques exist for nonrigid facial motion analysis 
from image sequences, the problem of face recognition has been ad- 
dressed dealing with still images only. Several face recognition tech- 
niques somewhat invariant to facial expression variations have been pro- 
posed, but only a handful of these actually use the structure of facial 
expression deformation, and none use deformation patterns themselves 
to help the recognition of faces. 

Motivated by these observations, we have studied and proposed a 
Bayesian recognition framework in which faces are modelled by individ- 
ual models of facial feature positions and appearances. Face recognition 
and facial expression recognition are carried out using maximum like- 
lihood decisions. The algorithm finds the model and facial expression 
that maximizes the likelihood of a test image. In this framework, facial 
appearance matching is improved by facial expression matching. Also, 
changes in facial features due to expressions are used together with facial 
deformation patterns to jointly perform expression recognition. 

We review previous research on face and facial expression recognition 
in Section 1. In Sections 2 and 3, we describe the proposed framework 
for embedded face and facial expression recognition from video. 
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1. Previous Approaches 

Very little work has been done to combine local feature appearance 
and spatial information of features in visual pattern recognition. Only 
very recently, an object recognition algorithm was developed to do so 
[36]. In that technique, the local appearance of small sub-windows is first 
clustered and re-quantized in a principal component subspace. Then, a 
model of the conditional probability of the position given the appearance 
cluster is computed in a larger window. Combining all local appearance 
models, a probabilistic representation of the visual pattern is obtained. 
This technique has been tested in the context of face detection, yielding 
excellent results. 

In the following sections, we briefly review the techniques used for 
face and facial expression recognition. 

1.1 Face recognition 

The main approach taken for face recognition is to compare some data 
present in a database with that of the probe image obtained from the 
person to be recognized. Early work used the facial feature relative sizes 
and positions to perform the comparisons, while recent, methods have 
been more successful by comparing facial appearances (i.e., the gray- 
level intensity). Several similarity measures and image preprocessing 
techniques have been used to deal with image variations due to light 
conditions, head pose, facial expressions, etc. Chellapa et al. [37] have 
reviewed the research efforts on face recognition and its related issues. 

Techniques for face recognition that deal with image variations due 
to facial expressions are of particular interest in this book. Little work 
has been done to improve face recognition using knowledge about the fa- 
cial expression motion patterns. One of the best performing approaches 
uses elastic graphs (e.g., “dynamic link architecture” [38, 39]) for image 
comparison with nonrigid deformation. Image variations due to sub- 
tle differences in facial expressions and head position are overcome by 
this similarity measure. However, the image deformation pattern itself, 
which reflects the facial expression patterns and the geometry of the 
faces, is not used as part of this similarity measure. 

Another interesting approach uses a parametric model which is fitted 
to face images [40, 41]. Comparisons of faces are carried out after the 
facial appearances are normalized with the head positions and the fa- 
cial expressions. In this technique, the face model is extremely simple 
and does not account for all natural expressions. On the other hand, 
no information about facial expression patterns is used as part of the 
similarity measure for face recognition. 
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The positions of the facial features are also used to estimate head 
poses and to normalize feature templates accordingly. This face recogni- 
tion approach [42] combines feature positions with feature appearances 
for similarity measurement, but it does not improve on the invariance 
to the facial expression deformations. In another approach for person 
identification based on the analysis of the spatio-temporal patterns of 
the lips [43], it has been found that combining shape information with 
intensity information improves recognition accuracy. 

1.2 Facial expression analysis 

Facial expressions have been widely studied from psychological points 
of view [44, 45]. Two of the most important results are that the face is 
the primary signal system for showing the emotions and that emotion 
understanding is crucial for interpersonal communications, relationships 
and successful interaction. At least seven distinct facial expressions are 
consistent across all races and cultures. Five expressions are associated 
to emotional states: neutral, happiness, sadness, anger and fear. Other 
facial expressions are surprise and disgust. 

Although most research on facial expression is based on static images 
of the apex of expression, facial expressions have also been studied as 
complex spatio-temporal motion patterns [46]. Algorithms for facial 
expression analysis and recognition are expected to perform better from 
image sequences than from static pictures. 

Facial image analysis has been studied for many years, but only re- 
cently have computer vision systems been developed for the analysis of 
facial expressions. Early model-based implementations [47, 48, 49, 50] 
were not completely automatic and required the facial features to be 
highlighted with special make-up. Later techniques have overcome this 
li mitation [51, 52]. 

Most research on facial expression recognition is based only on non- 
rigid facial deformation patterns. Techniques used for motion analysis 
are optical flow [53, 54], 2D graphical models (“potential nets”) [55] and 
local parametric models [56, 57]. Appearance variations due to facial 
expressions that are not well described by motion fields are ignored. 

Most expression recognition algorithms were applied solely to the apex 
of the facial deformation, discarding most of the spatio-temporal infor- 
mation present in the transitions. Facial expression recognition based 
on hidden Markov models (HMMs) [58, 59] have lately been proposed to 
take advantage of the spatio-temporal character of the facial expression 
patterns. 
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Figure 5.1. Scheme of facial features and regions 



2. Modelling Faces by Facial Features 

In this section, we describe the proposed approach for face and fa- 
cial expression recognition by modelling faces with the appearance and 
geometry of the facial features. Faces are modelled as a set of regions 
containing subsets of facial features. The appearance of each facial fea- 
ture is provided by the image sub-window located around its position, 
and the feature position is normalized with respect to the outer corners 
of the eyes. 

Figure 5.1 illustrates the four facial regions and the nine facial fea- 
tures used in our implementation. These face models and recognition 
algorithms are based on the assumption that the facial features can be 
accurately detected and tracked in image sequences. 

Consider the face recognition problem where p e {1, 2, . . . , P} is the 
index to the pth person in a database of P people and f is the portion of 
the observed image used for face recognition. Face recognition is carried 
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out with maximum likelihood classification, 

P* = arg max P(f|p), (5.1) 

p=i,...,p 

by selecting the model that maximizes the likelihood probability of the 
observed image. 

We use a hidden, discrete variable e = 1,2, ... ,N to index the facial 
expressions. The likelihood probability of the image f given the identity 
class p is computed from 

N 

P(fW=5>(f|*./»)P<eW- (5.2) 

e=l 

Using this framework, the proposed embedded face and facial expression 
recognition algorithm selects the person’s model and the facial expression 
that maximizes the likelihood of the test images. Then, in a person- 
dependent scheme, the detected facial expression is simply that which 
maximizes the likelihood for that person’s model. 

We model faces with a set of regions containing facial features. We 
assume these facial feature regions {r.; i — 1, 2, . . . , R} to be independent 
for a given person and facial expression, and compute the likelihood 
probability of the observed image from 

M 

P(f|e,p) = n p ( r ^)' (5-3) 

fc=i 

Finally, we compute the likelihood probability P(rfc|e,p) of a region 
based on the position and appearance of its facial features as 

P( r fcl e >p) = 

P(^fcl) • ■ ■ i |xfci i i X-kFk > G P) 

P(xfci,...,x fci rje,p). (5.4) 

Figure 5.2 illustrates schematically this probability network. Note 
that ovals represent hidden states, rounded corner rectangles represent 
data structures, and sharp-cornered rectangles represent actual observed 
data. 

We jointly model the class-conditional probability of the positions of 
the facial features in a region P(xj, . . . , x/^|c, p) with a multidimensional 
Gaussian distribution using a full-covariance matrix and mean vector 
estimated from the examples. We model the appearance of each facial 
feature in a region independently as in 

F 

P(vi,...,vf|xi ,..., x F , e , p ) = JJP(vi|xj,e,p). 

k = t 



(5.5) 
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Figure 5.2. Probability network for embedded face and facial expression recognition 



2.1 Class-conditional probability of the feature 
images 

We model the appearance of each facial feature with a multidimen- 
sional Gaussian distribution applied over the principal component sub- 
space with p dimensions. Note that this is different from the eigenfaces 
approach in which principal component analysis (PCA) is used to find 
the sub-space in which all object classes span the most. 

Let, x £ be a d-dimensional random vector with some distribution 
that we are to model. We use a set of training samples of the class in 
question to estimate the mean x and the covariance matrix Cl of the class. 
Using singular value decomposition, we obtain the diagonal matrix S 
corresponding to the p largest eigenvalues of Cl and the transformation 
matrix T containing the corresponding eigenvectors. The conditional 
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probability of x for a given class is then computed from 



P(x) 



1 

—========= exp 

y/{2i r)P det(S) 




(5.6) 



where y = T(x - x) is the projection of x onto the aforementioned 
p-dimensional subspace. 



3. Modeling Temporal Information of Facial 
Expressions 

In Section 2, we proposed a maximum likelihood, face and facial ex- 
pression recognition procedure to test image frames independently. For 
a given video segment V = {ft, f 2 , • • • > f?’}, the li kelihood maximization 
in (5.1) turns into 

P* = arg max P(V|ft), (5.7) 

*=i P 

where the conditional probability of V given the identity pi (the class) 
is computed from 

T 

p(vi*>-n p <fti*>- m 

k= 1 

Note that this collection of image frames does not need to be time- 
sequential; therefore, the temporal information about the facial expres- 
sions is discarded. 

One technique for overcoming this limitation, based on HMMs, is to 
replace the hidden variable e by a hidden state e(t) so that the probability 
of a facial expression at time t is obtained from 

P[e(t)|p] =P[e(f)|e(f- l),p] P[e(f-l)|p], (5.9) 

where P[e(i)|e(f — l),p] is the state transition matrix of the HMM for 
person p. 

Using this HMM in a video segment, i.e., a time-sequential collection 
of images, the likelihood probability of the video at time t , V(f) = 
{fj , f 2 , . . . , f i } for a given person p, is computed recursively from 

P[V(0|p] = P(V(f - l)|p) P(f(t)|e(t),p) P[e(t)|e(t - D.p] (5.10) 

so that the recognition procedure in (5.7) does take into account the tem- 
poral information of the facial expressions. In this framework, face and 
facial expression recognition from video segments can also be carried out 
using either the Viterbi algorithm or the forward-backward algorithm. 




Chapter 6 



3-D MODEL-BASED IMAGE 
COMMUNICATION 



1. Introduction 

The following chapter will be organized into 4 main sections. The first 
will begin with a general introduction to the model-based paradigm and 
the different frameworks within which it can be used. The next 2 sections 
will deal with 3 of the main steps in the model-based approach: analysis, 
synthesis, and modelling. Finally, we will conclude with a description of 
a real-time model-based coding system being developed at the Beckman 
Institute and present some encouraging results. Throughout the chapter 
we will discuss some of the previous approaches to the different problems 
in model-based coding and then present some of the current work being 
done by the authors. Similar to most model-based coding research, 
we concentrate our efforts on head-and-shoulders type of images. To 
that end, we consider issues such as rigid and non-rigid tracking, object 
modelling, and computer graphics synthesis, in the context of human 
head and faces. This is not to say that the model-based approach is not 
of use in other scenarios however since it can be extended to deal with 
many types of objects, both synthetic and real. 

1.1 The Model-Based Approach 

The problem of extracting information from a scene is a difficult task, 
often is ill-posed, impractical or un-realizable. Extra information or 
constraints are usually required, but closed form solutions are rarely 
available. In most cases, it is easier to model the scene and adjust the 
parameters of such model until it matches the observations. Those pa- 
rameters would provide the desired information from the observed scene. 
Additionally, such set of parameters is expected to provide a compact 




44 



Facial Analysis From Continuous Video With Applications 



representation of the non-redundant information of the scene, and there- 
fore, they provide the key for advanced, more efficient and flexible coding 
techniques. This is the basic idea of model-based approaches in Com- 
puter Vision and Visual Communications. 

1.2 Model-Based Analysis 

Pattern Recognition and Image Understanding techniques can be im- 
proved if knowledge (models) are made available to them. Figure 6.1 
shows the overall scheme of a model-based analysis system. The input 
image is pre-processed, possibly with the help of the knowledge provided 
by the model, to obtain the raw information that is matched with the 
model output. The model is then tuned iteratively until the modelling 
error is minimized. The parameters that produce the best match with 
the observation are expected to accurately describe the ground-truth of 
the observed scene. 




Input Image 1 



Model Space 



Model Parameters 



Figure 6.1. Model-Based Analysis: Information is extracted from the scene by tuning 
the parameters until the modelling error is minimized. 



One example of a successful implementation of a analysis system 
driven by a model is reported in [65], [29]. In this system, the global 
head pose and the individual feature position are tracked over long video 
sequences using a three-dimensional model of the head; the feature posi- 
tion are located with the help of templates that are synthesized using the 
model at an approximated head pose. This implementation is discussed 
in more detail in the latter section of the chapter. 

Expression recognition can be improved by using motion parameters 
obtained via model-based analysis. Such motion parameters are normal- 
ized with the global head pose, and therefore, reduce the limited scenario 
in which current applications succeed. On the other hand, a parametric 
description of the facial motion obtained with model-based approaches 
might be better to describe expressions since it could be closely related 
to actual physical system that causes such gesture patterns. 
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Input Video ► Model Parameters Output Video 

Figure 6.2. Model-Based Coding: model parameters are encoded and sent through 
the channel; then, the receiver synthesize the output 



1.3 Model-Based Coding 

In the past decade or so, research in the application of the model- 
based approach to image and video coding has been strong. The main 
motivation has been the drastic reduction in bandwidth for transmission 
as well as obtaining a higher level of representation for video streams. 
In this context, we assume both the transmitting and receiving parties 
have knowledge of the 3D model. By analyzing the input video stream, 
high level information regarding the activity in the scene is extracted 
based on the models available. This information is sent to the receiving 
end where it is applied to the local model, Fig. 6.2. There are several 
advantages to this approach for coding video: 

■ Model parameters provide a compact representation of the geometry 
of the scene as well as its motion. Most of the underlying redundancy 
is covered with the global object position and its articulated local 
motion. 

■ Model-based coding allows operations such as rendering synthetic 
images at different view, virtual environments, etc. 

1.4 Virtual Agent 

Model-based approaches find an important application in Human Com- 
puter Interface research, where the ultimate goal is to have the computer 
behave like an agent, Figure 6.3. This virtual agent should not only 
recognize and understand the user, but also look, move, and respond 
accordingly. 

Modelling information should include realistic gesture and motion. 
Off-line analysis of human behavior should be carried out to capture 
the motion patterns that the virtual agents will adopt, for example, 
gestures, lip motion, etc. On-line analysis would provide the system 
with the ability to perceive the user. 
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Figure 6.3. Virtual Agent: A fully automatic, synthetic agent that allow natural 
interaction between humans and computers 



2. Modelling and Analysis 

Modelling and animation of human faces has been an important, re- 
search issue with many practical applications in the fields of computer 
animation, model-based video compression, and human-computer inter- 
action. The objective is to generate photo-realistic facial images. To 
achieve this goal, several issues including the geometric face model, the 
face articulation model, and the synthesis techniques have to be exten- 
sively investigated. 

A geometric model includes the geometric surface mesh, the texture 
information, and the rendering environment description. The deforma- 
tion of this model is described by a facial articulation model, which is 
usually consistent with the physical rules to make the rendered face im- 
ages visually convincing. Depending on the nature of an application, 
many techniques are applicable to the human face modelling process. 
One approach to simplify the procedure is to divide a model into four 
layers and build these layers one upon another [66]. The bottom layer 
is the geometric model. The next three layers are the parameter layer, 
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the expression and viseme layer, and the script level. Each higher level 
is an abstraction of the lower levels. 

2.1 Geometric Face Modelling 

The surface representation plays an important role in a geometric face 
model. There are many ways of describing a 3D surface. Polygonal mesh 
surface model, free-form parametric surface model and implicit geomet- 
ric surface model are most commonly used. It is still very challenging 
to model detailed facial features such as wrinkles and marks using these 
methods. Texture mapping technique is considered to be a solution for 
this problem. However, because of the technical limitations, the texture 
mapping technique was not applied in most early face models. In recent 
years, graphics workstations are capable of rendering texture mapping 
in real time and this solution becomes more appealing. 

Besides the geometric model, an environment model, which describes 
the relationship between the facial model and its environment variables 
such as the lighting sources and the background description, should also 
be carefully considered to achieve realistic visual effects in a video com- 
pression system. 

2.1.1 3D Surface Modelling 

Most of the early human face surface models are composed of irregular 
triangular meshes. Precise approximation is achieved when the shape is 
sampled sufficiently. The advantage of using this configuration lies on 
the fact that triangular meshes are very flexible in modelling complicated 
objects. Also, they can be rendered efficiently. 

For describing irregular shapes such as a human face, a free form sur- 
face model is preferred than a solid surface model. Two main types of 
free form surfaces are parametric surfaces and implicit surfaces. Para- 
metric surfaces such as Bezier surfaces and B -Spline surfaces have been 
investigated intensively in the areas of approximation theory and com- 
puter aided geometry design for many years [67] [68]. Unlike polygonal 
surface models which only achieve O-order continuity (C°), this approach 
is capable of constructing surfaces with higher order of continuity. A 
drawback of this method, however, is that it is computational expen- 
sive. Fortunately, this may not be a problem in the future. 

A straightforward approach to improve the face surface model is to 
derive a free form model directly from an existing polygonal model by 
applying interpolation techniques. Algorithms for interpolating triangu- 
lar or rectangular meshes are both available. At first look, triangular 
mesh interpolation approach is very attractive. However, after careful 
analysis and experiments, we found that many problems in triangular 
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patch interpolation are not solved yet [69] [70] [71] [72]. One prob- 
lem is the choice of normal vectors on vertices. They are important 
parameters which will affect the shapes significantly. So far, no algo- 
rithm guarantees optimal solutions in all situations. Another problem 
is the computational cost. The triangular patch interpolation is much 
slower than the rectangular mesh interpolation. The reason is that, for 
rectangular mesh, a 2D interpolation process is decomposed into two 
ID interpolation processes. A disadvantage of rectangular mesh is its 
lack of flexibilities in modelling complicated shapes. To compensate this 
drawback, hierarchical rectangular meshes are adopted in our system. 

2.1.2 Generic Face Model from MRI Data 

To build a generic face model, the first step is to acquire the positions 
of sample points on the surface of a real human face. Several techniques 
are presented to accomplish this task. Using Cyberware 3D color scan- 
ner data is a handy approach. However, the problem is that the internal 
structures of a human face, such as the bone structure and the muscles, 
are not observable. An alternative method is to analyze magnetic res- 
onance imaging (MRI) data, which give us the information both about 
the surface and the inner structures [73]. The process of modelling face 
surface using MRI data includes the following three steps: 

1 Contour fitting in each interested MRI data slice 

In this stage, fixed number (25 points in our model) of sample points 
on the surface contour in each interested MRI data slice are manually 
extracted . By assuming that a face is symmetric, only 13 points are 
needed in each slice. These sample points are adjusted so that the 
interpolated B -Spline curve from these points fits the contour. 41 
data slices are sampled in our system for the face surface. 

2 2D interpolation 

From step 1, a 25 X 41 rectangular mesh is obtained. The bi-cubic 
B-Spline interpolation scheme is applied to this mesh to calculate a 
bi-cubic B-Spline surface model. 

3 Refine the interested regions 

This step is accomplished by repeating step 1 and step 2 on local 
features such as nose and ears. For example, in our model, the nose 
is refined to a 10 X 8 mesh; the ears are refined to 12 X 15 meshes. 
Picture of the derived generic face model is shown in Figure 6.4(a). 
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Figure 6.4 . (a) A Bi-cubic B-Spline facial surface model, (b) Texture mapped version 

of a fitted model. 

2.1.3 Face Geometric Model Fitting 

After a generic geometric facial model has been derived from a partic- 
ular data set, the next step is to fit the model to a specific person. This 
process is also called model fitting. Model fitting problem has been in- 
vestigated in various research fields such as scattered data interpolation 
[74], free form surface model deformation [75] [76], and elastic object 
deformation [77]. The common goal of these processes is to transform 
shapes in 3D smoothly. In the facial model fitting case, an additional 
constraint is imposed: the resulted model should be consistent with the 
geometry of a real human face. 

Problem Formation. The fitting problem is generally stated as fol- 
lows: given 3D surface models A, B and some points pi, i = 0, 1, . . . , n— 1 
on surface A and their corresponding points qi, i = 0, 1, . . . , n — I on sur- 
face B, find a C 1 continuous mapping function F : A — > B that satisfies 
F(pi) = q t and F(A) is a reasonable facial surface model. 

The above statements also address a 3D scattered data interpolation 
problem. One of the special cases in which all Pi and q t are coplanar has 
been intensively investigated for image warping. Most image warping 
algorithms [78] [79] can be extended conveniently for 3D model fitting. 

Image Warping Methods, image warping 

To understand the image warping process, imagine there is a rub- 
ber sheet with an image printed on it. Image warping has the same 
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effect on the image as stretching the rubber sheet at interested points. 
Suppose the 2D displacements at some image feature points are given, 
an appropriate displacement interpolation scheme is necessary to de- 
rive the displacements of all other image points. Some algorithms using 
this interpretation have been proposed. Three major categories are: 
triangulation based methods, inverse distance methods, and radial ba- 
sis function methods [80]. In triangulation based methods, a Delaunay 
triangulation is first constructed from image feature points, then polyno- 
mial interpolation of displacement values is performed in the image space 
and the warping function is found. Recently, it has been proved that 
this approach is optimal in terms of minimizing roughness, although the 
triangulation process involves no knowledge of the displacement values 
[81]. In inverse distance methods or radial function methods, the basic 
idea is to interpolate the displacement values using weighted averages. 
The weights are derived from inverse distances or radial function values. 
The farther an image point is away from a feature point, the less it is 
affected. A common problem of these methods is that they treat all fea- 
ture points equally. When the warping problem is a multilevel mapping 
by nature, like human expression synthesis, they often fail to produce 
expected results. To overcome this problem, local bounded radial basis 
method is introduced. The key idea is to define an effective range of 
a radial function. This one more dimension of freedom produce more 
pleasant results if the feature points and function bounds are carefully 
chosen. Ruprecht [80] wrote a excellent survey on these image warping 
algorithms. 



Face Model Fitting Using Voronoi- Weighted Diagram. In our 

approach, each feature point is assigned a weight according to its influ- 
ence. A weighted Voronoi diagram is constructed based on this informa- 
tion. Then, interpolation methods are applied to this Voronoi diagram. 
In this process, only the feature points with the largest weights generate 
Voronoi cells. This also means that only those points with large weights 
are mapped to their final positions. Once the feature points with large 
weights have been mapped correctly, their weights are reduced to the 
maximum weight of those unmapped feature points and their displace- 
ment values are set to 0. This procedure is performed iteratively until 
all feature points are correctly mapped. The advantage of this approach 
is that the underlying triangulation is changing between each iteration 
to fit the displacement scale. Both global rigid motions and local non- 
rigid deformations are modelled appropriately using this approach. More 
details are given in [82]. A fitted face model is shown in Figure 6.4(b). 
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2.1.4 Face Geometric Model Compression 

For quadrilateral meshes, scalable compression schemes are easy to 
be derived [83]. Here, the term scalable means that a layered coding 
scheme can generate object oriented wire-frames of different resolutions 
from a single source. As a result, a face model has multiple represen- 
tations. When the model is rendered on a high performance platform, 
a high resolution version is used. Otherwise, a simplified version of the 
same model is adopted. Working in concert with existing texture coding 
standards, this type of wire-frame compression techniques provides seal- 
ability functionality to many computer facial animation systems. The 
proposed wire-frame compression scheme consists of the following three 
steps: 

1 Coordinate system transformation 

The first step is to transform the wire-frame data to a cylindrical coor- 
dinate system. The three resulted coordinates are radius, angle, and 
height (r, q, h). For a single rectangular mesh, each of them is repre- 
sented as a matrix. The transformed coordinates are smoother than 
the original data and are more efficiently encoded (Figure 6.5(a)). 

2 Intra-mode and inter-mode coding 

The wire-frame data are coded in either intra-mode or inter-mode. In 
the intra-mode, the entire wire-frame structure is coded into multiple 
layers of bit-streams and transmitted to the decoder. In the inter- 
mode, only the prediction errors of the wire-frames are transmitted. 
Both coding modes are used for the purpose of down-loading a new 
face model. However, the inter-mode exploits the predictive coding 
further when the decoder and the encoder have the same base surface 
model. 

3 Pyramid progressive coding 

The resulting wire-frame data is progressively coded, as shown in 
Figure 6.5(b). The down-sampling operation computes the average 
position of 4 neighboring points, which is then quantized and formed 
the next layer of data (lower spatial resolution) to be coded. The 
residual errors between the original data and the quantized average 
data are compressed using entropy coding scheme and transmitted to 
the decoder. The advantage of this scheme is that the computations 
are simple, local, and can be performed in parallel. 
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(a) 



(b) 



Figure 6.5. (a)Block diagram of the wire-frame compression scheme, (b) Pyramid 
progressive coding method. 



2.2 Facial Articulation Modelling 

2.2.1 Articulation Parameters 

When the animation of the facial model is concerned, using an ac- 
curate facial articulation model is crucial. One of the early works to 
understand facial motions is the facial action coding system (FACS) de- 
veloped by Ekman et al. [84], FACS system defined a minimum set of 
facial deformations for driving a face model. 

Recently, some physical-based models have been developed [85] [49]. 
In these models, the skin and the tissues are modelled as elastic ma- 
terials. Muscle activities are considered to be the stimulation to this 
mechanical system. The state of this system with minimum energy is 
computed using finite element method. Though physically, these mod- 
els are close to the real facial motion model, they often fail to generate 
realistic results. The reason is that the real facial muscle system is very 
complicated and most existing systems are only very rough approxima- 
tions. 

Another type of dynamic facial models describe the facial articula- 
tions in terms surface deformations. An example is the facial animation 
parameter set (FAP) defined by MPEG-4 synthetic and natural hybrid 
coding (SNHC) sub-group [86]. By defining this set of parameters, com- 
munications between distributed applications such as talking head, tele- 
conferencing, and intelligent human agents are allowed. 

There are totally 68 parameters in FAP. 66 of them are articulation 
parameters, or more precisely, the position offsets of facial feature points 
relative to their neutral positions. For example, open -jaw depicts the 
downward movement of the jaw. Most FAP parameters can be obtained 
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by computer vision techniques such as point tracking techniques. In our 
system, FAP stream is used to articulate the face model. 

2.2.2 Model Deformation 

Based on FACS system, some geometric articulation models have been 
developed [87] [88]. In these models, the geometric deformations of a 
single or a group of muscle actions are described. The rendered results 
are the combination of these deformations. Thalmann et al. [87] pro- 
posed a facial articulation model called abstract muscle action procedures 
(AMAP), in which the facial movements are described by the displace- 
ments of the surface vertices. These displacements are implemented in 
many procedures. Each procedure is defined individually. This approach 
is similar to a performance-driven model in a sense that they both only 
describe the deformations in pure geometric terms. However, the AMAP 
is relatively tedious to develop. To overcome this problem, Thalmann et 
al. [88] developed a free form surface deformation system to emulate the 
facial motions. To some extend, this method simplify the AMAP model. 
However, choosing the bounding boxes of the deformation parallel-pipes 
is still a time consuming task. 

Pure geometric approaches usually are not successful in modelling 
non-linearity in complex articulations. Physical models are introduced 
to handle these situations. Waters et al. [85] [49] developed a three-layer 
muscle model. In this model, three connected spring layers are config- 
ured to model the out-most facial skin, intermediate layer of soft tissues, 
and the underlying fixed bone structure. Muscles are modelled as elas- 
tic line segments with one end attached to the bone layer and the other 
end attached to the tissue layer. Without any muscle action, the fa- 
cial expression stays neutral. When the contraction parameters of some 
muscles are provided, the dynamic system is no longer stable. Then, 
finite element method (Euler method) is applied to find the minimum 
energy state, which is the final facial expression. Platt and Badler [89] 
also developed a similar system for a high resolution face surface model. 
A common problem of physical models is that the users has no control 
over the final results once the physical parameters are set. Usually, it is 
difficult to foresee the animation results from these data. 

In some situations, since the goal is to make the synthesized face visu- 
ally identical to a given image or a given video sequence, it is inefficient to 
articulate a complex model with underlying structures of a face. With 
this motivation, some so called performance -driven facial articulation 
models were developed [90] [91]. The kernels in these models usually are 
tracking algorithms. Makers are put on the a face object to make the 
tracking process easier. Texture mapping techniques are also exploited 
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to create photo-realistic face images [92]. In our implementation, since 
FAP parameters are employed, this type of articulation model is partic- 
ularly appropriate. Based on this approach, we also integrated physical 
muscle model and pure geometric articulation model into our system. 
Figure 6.6 shows a synthesized expression “surprise”. 




Figure 6.6. Synthesized facial expression “surprise’ 
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2.2.3 Articulation Parameter Stream Compression 

In our scheme, principal component analysis (PCA) technique is ap- 
plied to exploit the correlations among FAP parameters and to represent 
them in an optimal coordinate system in which the energy of the original 
signal concentrates in a smaller sub-space. For the new representation, 
the correlations among different components are 0. 

Suppose for each time instance, or frame, the original FAP parameter 
is a 68-dimension vector V{ and also suppose the ensemble average of V{ is 
v. Then, the covariance matrix is C = jpy ]>3i==i( w i — v/, where k 
is the number of the FAP vectors. The eigenvectors of C are orthogonal 
to each other and span a new coordinate system. The eigenvalues of C 
indicate the energy distribution over each coordinate. 

Then the most significant eigenvalues are extracted and a sub-space 
is formed by eigenvectors corresponding to these eigenvalues. This sub- 
space contains most of the original signal energy. In other words, the 
projection of the a original FAP vector in this sub-space is a good ap- 
proximation of that vector. From MPEG-4 test sequences that have been 
investigated, it was observed that the mean square error is less than 
2% of the original signal energy if 8 most significant eigenvectors are 
used. As a result, the representation dimension is reduced dramatically 
from 68 to 8. This projection process is also known as Karhunen-Loeve 
transform (KLT). After appropriate quantization, these 8 components 
are differentially encoded. A block diagram of this method is shown in 
Figure 6.7. 

The source of compression in PCA scheme is the correlations among 
FAP parameters. These correlations reflect the fact that each expression 
involves many physically related muscle movements. Principal compo- 
nent analysis is a powerful tool to take advantage of this property. 

2.3 Synthesis 

Fast rendering of realistic articulated face model is demanded by real- 
time applications. Several issues including environment model and tex- 
ture mapping techniques have to be carefully considered. 

2.3.1 Environment Model 

To get realistic animation results, environment variables such as the 
lighting sources and the background have to be considered. If it is 
assumed that the positions and the properties of lighting sources are 
known, the following procedure will incorporate the texture mapping 
process and the lighting model into face model synthesizer. First, the 
facial surface model is rendered without texture mapping. Then, the 




56 



Facial Analysis From Continuous Video With Applications 



/ \ 

aO 

al 



FAPs 

lectori 



a67 
\ / 

/ S 
aO' 

al' 



Est. 

FAPs 

lectori 



a67 

\ / 




KIT 



cO' 

cl' 



X 



clO' 

sig. 

comp. 




Channel 



Figure 6. 7. Block diagram of a FAP compression scheme 

texture is attached to the surface using a blending techniques. The 
simplest blending function is a liner interpolation, which is written as 

c = ac t + (1 - a) ci, a € [0, 1], 

where ci is the color of the surface with no texture attached to it; c* is the 
color of the texture, and Ot is the transparency property of the texture. 
This process is automatically performed on some SGI workstations. 

A more complicated problem is to combine a face model with its 
background. A possible solution for generating a simple background is 
to produce models for all background objects such as walls and windows. 
An even simpler approach is to put a background image behind the 3D 
face model. If the background is not complicated and the view projection 
is stable, and also, the facial model is always in front of the background 
objects, then this approach often gives satisfactory results. 

2.3.2 Texture Mapping 

Texture mapping technique is applied generate the fine structures of a 
human face like wrinkles and scars. With the help of dedicated graphics 
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chips, workstations are able to perform texture mapping of high resolu- 
tion images in real time. The key issue in texture mapping is to establish 
the one-to-one correspondence between the face image (texture) and the 
3D model surface. When the face surface model is represented in planar 
polygons, triangles for example, the mapping process is straight forward. 
First, the polygons in the 3D model are projected to the 2D face image, 
which produces a 2D triangle. Then the texture in the 2D triangle is 
mapped to the 3D triangle surface. Actually, only the coordinates of the 
vertices in 2D need to be fed to the graphics synthesizer. The texture 
mapping function is then trivially generated. 

For B-Spline surface model, as discussed in previous sections, the 
problem is how to find the corresponding area of a bi-cubic Bezier patch 
in a face image. A property of 3D Bezier curve is that its projection on 
any 2D plane is a 2D Bezier curve. As a result of this property, in a 
face image plane, the corresponding area of a 3D Bezier patch is also a 
Bezier patch. From this conclusion, a texture mapping scheme is easily 
derived. First, all control points of a Bezier patch are projected to the 
image plane. Then, the mapping function is stated as 

C(s,t) = I{s,t), [0,1], 

where C(s,t) is the 3D Bezier patch, I(s,t) is the Bezier patch in a 
face image plane. 

A problem with above method is that, when a single face image is 
used, for example, only the front-view image is considered, a 3D Bezier 
patch with large area may be projected to a Bezier patch with very small 
area in a face image. When this patch is rendered, a low resolution 
texture is observed in that region. For facial model, patches on cheeks 
are vulnerable to this problem. A solution is to acquire both front-view 
and side-view texture images and blend them later in rendering stage. 

3. Analysis 

The analysis stage, in the context of model-based image communica- 
tion, consists of extracting higher level information from the input video 
sequence. We discuss 2 main types of information, both pertaining to 
motion estimation: rigid head motion tracking and non-rigid facial mo- 
tion estimation. 

3.1 Review of Past and Current Work 

This section describes recent and current work in the field of analysis 
for model-based video and image communication. We first list some 
excellent reviews in the field of Model-based Video Coding(MBVC) and 
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then examine in detail one of the major research areas: head and facial 
feature tracking. 

3.1.1 MBVC systems 

The majority of research in MBVC has been driven by two different 
goals: 

■ Realistic reproduction of the original input sequence, using the model- 
based approach to drive the bit rate below those achievable with 
conventional coders. 

■ Synthetic reproduction of the original input sequence using analysis 
of the 2D scene to extract higher level parameters to drive an artificial 
head model at the decoder. 

In general, most approaches have been geared towards one of the above 
goals, however there are systems proposed to handle both cases effec- 
tively [93]. Some excellent reviews on MBVC and its recent progress can 
be found in [94], [95], [96]. We next discuss in more detail the recent 
work done in the areas related to head and feature tracking. 

3.1.2 Rigid Motion 

In this section we review recent work in 3D head tracking which in- 
volves the recovery of the 3D rigid motion of a head in a 2D video 
sequence. Most of the recent head tracking (or more generally, 3D mo- 
tion estimation) research has taken one of two approaches: motion from 
feature tracking and motion from optic flow or primitive vectors. We 
consider both approaches in the following review. 

In [92] the authors use optical flow equations to estimate both the 
global head motion and local facial expressions. Similarly, in the work 
by Li and Forcheimer, they estimate rotation parameters by enforcing 
the optical flow equations [26], [97], Depth information is assumed in 
later work by using the CANDIDE face model and small motion from 
frame to frame is assumed. Nakaya and Harashima use the distribution 
of 2D motion vectors on the head (computed on a block basis) to com- 
pute 3D rotation and translation [93]. Least squares method is used and 
depth values are taken from the wireframe head model. A straightfor- 
ward template matching technique is used by Kokuer and Clark to track 
features in 2D. A cylindrical head shape is assumed to compute the cor- 
responding 3D motion for each axis independently [98]. Basu, Essa, and 
Pentland use a 3D ellipsoidal model of the head to interpret optical flow 
in terms of possible rotations of the head [27], [28]. This method seems 
to work well, however the computation of optical flow is intensive and 
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no 2D tracking of facial features is performed. They test their method 
with real and synthetic sequences. Bozdagi, Tekalp and Onural present 
an algorithm that uses an optical flow based framework to estimate 3D 
global motion, local motion, and the adaptation of the wireframe model 
simultaneously [32], [99]. The entire wireframe is flexible and can vary 
from frame to frame to minimize the error in the optic flow equations. 
To overcome the disadvantages of traditional optic flow based methods, 
they also include photometric effects. The approach seems to have a 
good theoretical basis but suffers from high computational load at each 
frame. 

Horprasert, Yacoob and Davis have employed a parameterized track- 
ing method to track 5 feature points on the face. Information regarding 
invariant cross-ratios from face symmetry and statistical modelling of 
face structures are used to compute 3 rotation angles [100]. In this ap- 
proach, head orientations close to frontal are very sensitive to tracking 
localization. In the research by Fukuhara and Murakami, a set of five 
facial features are tracked in 2D, giving motion vectors from frame to 
frame. These vectors are input to a three-layer neural net to determine 
the 3D motion. The neural net is trained using many possible motion 
patterns of the head with an existing 3D model [101]. There are sev- 
eral drawbacks including non-automatic initial feature selection and a 
simple template matching to track the features. Also, the recovered 3D 
motion is restricted to be one of a discrete number of possible motions 
determined by the training. 

3.1.3 Non-Rigid Motion 

Another major research topic is the estimation of the non-rigid facial 
motion in a video sequence. This motion is the result of the many facial 
expressions humans make to communicate, therefor research in the area 
of expression detection and recognition are relevant. The majority of 
work has concentrated on computing facial motion from frontal or near- 
frontal pose images. More recently, the incorporation of varying head 
pose has been examined. 

Matsuno et. al. use a deformable two-dimensional net, which they call 
a Potential Net, to detect expressions in input images. A training proce- 
dure is used to build a model of the net deformations for different facial 
expressions [55]. In the work by Yacoob and Davis an optic flow ap- 
proach is used to analyze and represent facial dynamics from sequences. 
A mid-level symbolic representation is used to detect 6 expression as well 
as eye -blinking [53]. Black and Yacoob use local parametric models to 
track both rigid and non-rigid motion. Six basic expressions are detected 
and some attempt is made to account for global, rigid head motions [56]. 
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Essa and Pentland propose a new method of representing expression by 
building a database of facial expression characterized by muscle acti- 
vations. They then use this database to recognize expression with an 
underlying physics based model as well as by matching spatio-temporal 
motion-energy templates [51] 



3.2 Tracking Rigid Global Motion 

The overall pose computation module we propose is shown in Figure 
6.8. This module, one of the most crucial steps in the tracking algorithm, 
provides higher-level information on the 3D movement of the head for 
use in synthesis, motion prediction, dynamic feature set changes, etc. As 
we see in the figure, there are three main steps. First, an initial estimate 
of the scale and transformation matrix is obtained by using the 2D-3D 
feature point correspondences. Next, an optimization stage computes 
the best true rotation matrix using a gradient descent algorithm. Finally, 
the resulting angles and scale factor are optimally filtered to smooth the 
data and to predict the motion in subsequent frames. 
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Figure 6.8. The 3D Motion Estimation/Filter/Prcdiction Module 



3.2.1 2D-3D Pose Estimation 

The traditional pose estimation problem has a long history and many 
major issues have been dealt with. We are faced with 2 main limitations 
that restrict the approaches we can use: 

■ Small number of facial features are available that are both salient 
and rigid. 

■ Non-trivial amount of localization error in the tracked points. 

These two factors discourage the use of more sophisticated algorithms 
which require large numbers of feature pairs, or are very sensitive to 
noise. Also, we do not make any assumptions on camera calibration or 
the availability of any camera parameters. The general pose estimation 
problem, in our scenario, is assumed to be the following. The imaging 
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model for true perspective projection converts model points to image 
points as follows (coordinates are in the camera coordinate system): 




Since it is difficult to analyze systems with this model, a common ap- 
proximation is a scaled orthographic projection system. In this model, 
object points are all assumed to have the same depth, Z: 

}-Hi} >“> 

(6.2) 




However, we don’t have {Xi, Yi, Zi} T (the model points in the camera 
coordinate system). What we have is: {Ui, Vi, Wi} T , (model points in 
the model coordinate system). Expressing the transformation from the 
object coordinate system to the image coordinate system as: 
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If we use the first point as a reference: 
Xi-X o 

Yi-y 0 } = R m 
Zi — Zq 
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Ui - Uo 
Vi -Vo 
Wi-Wo 



(6.3) 



Where = [i,j,k] T . Substituting Eqn. 6.3 into Eqn. 6.2, and using 
the corresponding image point as reference, we obtain: 




Assuming we have at least 4 non-collinear and unique points, i=0 to 
3, we can create the following linear system: 



Image = s ■ 



i 

j 



• Model 



After a simple matrix inversion(it exists because of the non-collinearity 
constraints): 



s 



i 

j 



= Image • (Model) 1 



(6.5) 
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We can then use the properties of orthonormal matrices to compute 
the desired parameters, s, i,j, k: 



a-»l = lls-Jl 



Savg — 



s-i + s 






k = i x j 




So, our final result is an estimate of the pose and scale of the object 
from the set of 4 feature point matches. 



s • R„ 



>avg 




( 6 . 6 ) 



Since we are assuming an orthographic + scale projection system, we 
cannot recover the translation in the Z axis. However, the 2D transla- 
tion, T can be calculated as: 

(6.7) 

The “ A ” symbol indicates that these are estimates to the true pose and 
scale due to the measurement noise in the image points. Ideally, this 
estimate would represent a true orthonormal rotation matrix. However, 
in calculating equation 6.5, we have forced the transformation to map 
model points to noisy image points and the result is not a true rotation. 
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3.2.2 Pose Optimization 

The recovered pose from the previous section is only an estimate to 
the true pose (for the assumed projection model). Applying this non- 
orthonormal transform to the 3D head model results in non-rigid de- 
formations, which is not desired. Also, since we would like to apply 
filtering techniques to the recovered pose, it is necessary to convert the 
estimated transform to parameters which would make sense to filter, 
such as rotation angles about each axis. Our approach is to: 

■ Express the pose as 3 consecutive rotations, one about each axis. 

■ Apply optimization techniques to recover optimal angles and scale 
for the estimated pose. 
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■ Filter these angles to obtain optimal estimates for the current pose 
and to predict future poses. 

The desired rotation matrix can be represented in an infinite number of 
ways. We choose to represent it as rotations of (Q y , o x ,e z ), about the 
respective axes: 

s ■ R = s ■ Re v Rg x Rg t (6.8) 

We can solve for the desired angles and scale by defining the following 
error measure: 



£ — f{9xi9y,9 z ,s) 

= P • R-oi - s ■ Rf) y Rq x R 6z 11^, 



This minimization can be solved by using a version of Powell’s quadrat- 
ically convergent method. The result, then, is a representation of the 
pose as 3 angles and a scale factor, that minimize the error with respect, 
to the computed transform. 

3.2.3 Filtering and Prediction 

Localization errors in the feature tracking module propagate to the 
recovered angles and scale computed above. When used for synthesis, 
applying these pose computations to a head model results in jerky head 
movements which is visually unacceptable. One way to overcome this is 
to use optimal filtering techniques to process the measurements of angles 
and scale for each frame. To do this we implement one of the more well- 
known optimal filters, the discrete Kalman filter. The Kalman filter 
is a recursive procedure that consists of two stages: time updates (or 
prediction) and measurement updates (or correction). At each iteration 
the filter provides an optimal estimate of the current state using the 
current input measurement, and produces an estimate of the future state 
using the underlying state model. 

The values which we want to smooth and predict are the 3 angles 
and scale that determine the 3D pose: (s, 9 X , 9 y , 9 Z ). We can filter these 
independently of each other, using a discrete-time Newtonian physical 
model of rigid-body motion. In general, the linear difference equations 
for each process can be written as: 



Xfc+i 



AfcXfc + Bu k -I- w fe 

HfcXfe + v* 



(6.9) 

( 6 . 10 ) 
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where, 
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We can ignore the terms Bu k since we assume no external driving forces. 
The variables , w and v, represent the process and measurement noise, 
respectively and are independent and white, with normal probability 
distribution: 

p{w) w N( 0, Q) p{v) ss N (0, R ) 



With that said, the filtering algorithm is, for each incoming process 
measurement, z k : 



K k = p-H'[(H k p-Hl -c R k y l 
x k = + K(z k - H k x^) 

P k = ( I-K k H k )P fc ~ 

Xfc^i — A k x k T Bu k 
P k+ il = A k P k A k +Q k 



The filtered estimate can be used to synthesize the 3D motion of the 
head model, while the prediction can be used to produce templates for 
tracking the facial features in the next frame. 



3.3 Tracking Non-Rigid Local Motion 

In contrast to the estimation of the global head pose, the estimation 
of the non-rigid motion from a single view is ill-posed. That is, given the 
two dimensional displacement of a marking point in the face surface, the 
three dimensional motion of the corresponding point in the head model 
cannot be computed uniquely. 

One possibility is to consider stereo vision, in which case, the actual 
three dimensional local motion of all the points that can be accurately 
matched from one view to the other can be easily obtained from geo- 
metrical constrains provided by the camera calibration parameters. This 
approach, however, might not be suitable for scenarios such as the one 
in low-rate video coding where only a single camera is most likely to be 
used, or accurate camera calibration can not be achieved. 

Another possibility is to impose some constraints to reduce the degree 
of freedom of the non-rigid local motion of the facial features so that it 
can be estimated from a single view. Finally, a model-based approach 
for non-rigid motion estimation of facial features can be used. Such 
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approach would require a head/face model in which non-rigid motion of 
the facial features is parameterized. 

3.3.1 2D Motion + Constraints 

One of the methods available for computing the 3D motion vectors 
associated with non-rigid facial motion, is the application of constraints 
to the 2D motion of the image points. As mentioned earlier, there are 
two type of motion the points on the image plane can undergo: rigid and 
non-rigid motion. The location of the rigid points allows us to compute 
the 3D rigid motion and then align the 3D head model to the 2D image. 
The non-rigid features will undergo the estimated rigid motion plus some 
non-rigid motion specific to that region of the face. Each image frame is 
composed of points from each of these sets, so for images at time £j we 
have: 

Image, frigid' I non-rigid.} 

Image\. igid = [T % o (Model) rigid \ 2d 

Image non _ r igid = \T r o (M odel) non _ r igi d \^ d + AV^ 

The first of these equations states that the image at time £,; is composed 
of an image resulting from rigid motion, and an image resulting from 
non-rigid motion. In the second equation, the image resulting from rigid 
motion is shown to be the result of transforming the 3D model and 
projecting into the 2D plane. The third equation states that the image 
resulting from non-rigid motion is composed of the rigid transformation, 
followed by projection to 2D, and then some 2D deformation. We can 
then use the location of the non-rigid features to back project up onto 
a 3D surface and retrieve the new model coordinates for that particular 
image feature. One way to create this surface is to use a Cyberware 
head scan of an individual, with appropriate smoothing to reduce surface 
discontinuities, Fig. 6.9. The underlying assumption we make here, of 
course, is that points on the face travel along its surface in 3D. This 
is certainly a valid assumption for facial features such as the eyebrow 
comers for example. However, even for points around the mouth and 
eyes, this assumption is valid enough to be used in recovering good 
estimates of the 3D motion vectors. The estimation of the new 3D 
points can be used to create a modified 3D model. 

Model' = Model + \Image\ on _ rigid j 3d 

Finally, using the new 3D model, plus a reference pose (for example, 
the pose of the head in the first frontal image frame), we can create a 
new image. The result is a pair of images that use the same frame of 




66 



Facial Analysis From Continuous Video With Applications 




Figure 6.9. (a) Original Cyberware head scan, (b) Smooth surface approximation 

used to recover 3D coordinates. 



reference and can be used to compute the Facial Action Parameters that 
describe the rigid and non-rigid motion in the sequence. After computing 
the desired facial motion, we can apply this to the synthetic 3D model 
along with the computed rigid head pose. This procedure is shown later 
in the Results section in Fig. 7.47. 

Image 1 ' = [_^0 ° {Model' )\ 2 d 
FAP = /(/°, I 1 ') 

At this point we should mention that the constraint that points move 
along surface of the model is by no means the only one we can impose. 
We can also recover 3D-motion vectors for several facial features by 
enforcing the constraint that the motion is perpendicular to the normal 
vector of the surface of the face. 

First, we need to obtain the normal vectors of the face surface given 
by the Cyberscan range data. Let 5 be a surface given by the range 
data as 



S = {X.(i,j) 6 JR 3 ; i = 0,1, ... N,j = 0, 1, . . . M} 

We compute the normal vector n(X 0 ) = [— n x — n y 1]', by fitting the 
plane: 



(X - Xo) • n(X 0 ) = 0 
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to the surface S using Minimum-Least-Squares over a region around X 0 ; 
that is, we find (n x ,n y ) that minimizes the error: 



53 (*(**i) “ “ ”*(*(* J) ~ ~ Vo)) 2 

(i,j)SWy c 0 

where X(*,j) = [s(i,j) t/(t,j) z(i, j)]' and X 0 = [x 0 y 0 z 0 )' are surface 
points, and Wx 0 is a region around X 0 . 

Let X 0 G 3? 3 be the position of a facial feature in the face surface of 
the head model; and let T be a 4x4 homogeneous transformation matrix 
representing the estimated head pose at a given frame such that: 



X' 0 = T • X 0 = R • X 0 4- 1 

where X! 0 G 3? 3 is the position of the feature in the image space, R is 
a rotation matrix, and t is a translation vector. Note that although 
the z-component of X' a is not used to render a synthetic image with 
orthographic projection, it is well defined. On the other hand, let X^ be 
the position of facial feature in the image frame of the video sequence 
obtained by template matching. 

First, we rotate the normal vector of the surface, n ; — R • n, so that 
it can be used in the image space. Then, we enforce the motion to lay 
perpendicular to this normal vector, that is, we compute z[ from: 






n' 



n 



n' 



Finally, we compute the 3D-motion vector Xi — X G in the model space 
using the inverse rotation matrix : 



(Xi-X 0 ) = R- 1 -(X , 1 -X^) 



Shown in Fig.6.10 is an example of this approach. 



3.3.2 Model-Based Non-Rigid Local Motion Estimation 

Local motion estimation 

Another approach to recovering nonrigid local motion is to use a 
model-based approach with parametric models. Let us assume we have 
a face/head model in which each vertex position is obtained from: 

x' = Xi +■ Vj(a) i = (6.11) 

where Xj, i = 1, • • • , N provides the shape of the individual being mod- 
elled, Vj(o:) is a parameterized description of each vertex motion, and 
the vector a represents the non-rigid motion state. 
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Figure 6.10. Results of 3D motion vector computation, (a) Location of 2D features 
in the video sequence, (b) The computed 3D motion vectors at the estimated pose, 
(c) The 3D motion vectors seen from another viewpoint. 



Note that for rigid points such as the eye corners, nose, etc., the 
motion component is zero. Therefore, independently of the facial ex- 
pression, the head head pose T can be estimated from the rigid points 
of the face. 

All the facial features not occluded at a pose T are projected to the 
image plane at the position: 

X-(oi) = T • (xj + v;(a)) i € S (6.12) 

where 5 is the list of visible points at the given view. 

The problem of estimating the non-rigid motion can be solved by 
finding a that minimizes the overall error: 

^£||X'(*)-X# (6.13) 

ieS 

where Xj is the position of the vertex i obtained from the input video 
frame using a visual matching technique. 

With this techniques, the quality of this analysis depends on the accu- 
racy with which the model describe the face expressions. Simple models 
might not provide good results. However, the optimization algorithm 
might show convergence problems if the models are too complex or have 
too many degrees of freedom. 

4. Model-based Video System Implementation 

Although progress in MBVC has been steady, it is still a very im- 
mature field compared to traditional coding techniques. One of the 
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most lacking accomplishments has been the development of complete 
coders/encoders with which we can test the validity of the research done 
so far. We aim to provide such a test-bed, Fig. 6.11, and to demonstrate 
its application in a realistic scenario. The coding system we assume is 
shown in Figure 6.12. It takes a computer graphics approach and an- 
alyzes the video sequences to extract higher level knowledge regarding 
motion, expressions, etc. These parameters are then sent along the chan- 
nel to drive a head model on the receiving end. Although this method 
has potentially lower bandwidth, the end result is a video sequence that 
is dependent on the quality of the underlying model, how well it is fit- 
ted to the subject, the underlying parametrization (muscle, tissue, etc.), 
rendering hardware available, and other factors. We now give a brief 
description of our basic approach. The research discussed in previous 
sections is used to implement the various stages of the system. 
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Figure 6.11. Model-based Coding Interface 



4.1 Tracking Non-Rigid Local Motion 

In contrast to the estimation of the global head pose, the estimation 
of the non-rigid motion from a single view is ill-posed. That is, given the 
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two dimensional displacement of a marking point in the face surface, the 
three dimensional motion of the corresponding point in the head model 
cannot be computed uniquely. 

One possibility is to consider stereo vision, in which case, the actual 
three dimensional local motion of all the points that can be accurately 
matched from one view to the other can be easily obtained from geo- 
metrical constrains provided by the camera calibration parameters. This 
approach, however, might not be suitable for scenarios such as the one 
in low-rate video coding where only a single camera is most likely to be 
used, or accurate camera calibration can not be achieved. 

Another possibility is to impose some constraints to reduce the degree 
of freedom of the non-rigid local motion of the facial features so that it 
can be estimated from a single view. Finally, a model-based approach 
for non-rigid motion estimation of facial features can be used. Such 
approach would require a head/face model in which non-rigid motion of 
the facial features is parameterized. 
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Figure 6.12. Model-based Video Coding System 



4.2 Tracking System 

In Figure 6.13 we can see the main tracking system for 3D and 2D 
motion. It consists of two main steps: the initialization and the main 
tracking loop. 
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Figure 6.13. Object Tracking System 

4.2.1 Initialization 

Initialization of the facial feature locations is done automatically using 
the first image frame and the texture map obtained from the Cyberware 
scanner, Fig. 6.14. After the initial features points are obtained, the 
3D pose is computed and applied to the head model [102]. This aligns 
the model with the first video frame so that the initial texture mapping 
can be performed. This texture map, along with any texture updates, 
is then used to create templates for the subsequent frames. 




Figure 6.14 . Initialization for Tracking System 



4.2.2 Tracking Loop 

The main tracking loop consists of 3 steps: 

1 Compute pose from 2D-3D point pairs. 

2 Render templates using pose and 3D head model. 

3 Locate features in current frame. 

We can note at this stage that the system has 3 major outputs which 
can be of use in a MBVC system: 3D motion estimates , 2D feature 
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tracking, and synthesized approximations to the original input sequence. 
Also, using the methods discussed earlier, we can convert the 2D feature 
tracking into the appropriate 3D motion vectors to drive the facial ex- 
pressions of a synthetic model. The current system will implement basic 
eyebrow, eye, mouth, and jaw movements. 
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Figure 6.15. Main Tracking Loop 



4.2.3 Head Modelling 

Our proposed tracking system uses an underlying 3D model of the 
object being tracked. We choose a straightforward method of obtaining 
head models by using a Cyberware 3D range scanner. The 3D coordi- 
nates of the features to be tracked are obtained directly from the range 
data. One factor, which is a concern for real-time rendering purposes, is 
the high-resolution of the data. To remedy this, a sub-sampled version 
of the head scans were used, along with texture mapping techniques, to 
create very accurately synthesized images. 

4.3 Practical Implementation Issues 

Currently, the system is implemented on an SGI Onyx Reality Engine. 
Using the highly optimized rendering pipeline the rigid head tracking al- 
gorithm runs at greater than lOfps. Input to the system is a monitor 
mounted COHU grayscale camera with video field size of 360x243 and 
image frame size of 720x486. The most computationally expensive step 
in the video analysis is the template matching stage. This has been im- 
plemented in parallel using multiple processors to increase performance 
and maintain frame rates when tracking large numbers of features. Video 
frames are processed as quickly as possible and no attempt is made to 
keep pace with the steady 30fps video input signal. For all graphics op- 
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erations, standard OpenGL libraries have been used so that the system 
is easily implemented on other platforms (Sun, HP, PC). The system 
has also been used in conjunction with a D1 digital tape machine and 
VLAN interface to analyze long video sequences without loss of frames. 
This is useful in conjunction with other projects such as video analysis 
for gisting and face recognition using video sequences. Since both the 
research areas can make substantial use of head tracking in general. 
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IMPLEMENTATIONS, EXPERIMENTS 
AND RESULTS 



In this chapter, we describe in detail the experiments carried out to 
evaluate the performance of the reported techniques. In Section 1, we 
describe the image and video databases used in the evaluations. Sec- 
tion 2 addresses the problem of face detection in complex backgrounds; 
we compare our face detection system with the neural-network-based 
system reported in [1], In Section 3, a more detailed analysis of the use 
of information-based maximum discrimination classifiers is presented in 
the context of facial feature detection and tracking. Finally, in Section 4, 
we evaluate the algorithm for embedded face and facial expression recog- 
nition. 

1. Image and Video Databases 
1.1 FERET database 

The face recognition technology (FERET) program of the U.S. Army 
Research Laboratory has developed a standard test to evaluate and com- 
pare different techniques for face recognition [24, 25]. As part of this test, 
a database of several thousands of facial images was collected. This 
database consists of several pictures per person taken at different times 
and views, with different illumination conditions and facial expressions, 
wearing glasses, make-up, etc. Examples of these images are shown in 
Figures 7.1 and 7.2. 

We use a subset of this database solely with the purpose of extracting 
information about faces. For this subset of 821 images of frontal view 
faces, we have labelled the position of the facial features by hand. We 
use these locations together with the images to train face and facial 
feature detectors, and to learn the distribution of the relative positions 
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Figure 7.1. Images from the FERET database 



of the facial features. Examples of these training procedures are given 
in Figures 3.2 (page 19) and 3.4 (page 21). 

1.2 CMU/MIT database for face detection 

An image database has been widely used to test neural-network-based 
face detection systems in complex backgrounds [21, 22], The database 
consists of three sets of grey-level images, two of which were collected at 
Carnegie Mellon University (CMU) and the other at the Massachusetts 
Institute of Technology (MIT). These images are scanned photographs, 
newspaper pictures, files collected from the World Wide Web, and dig- 
itized television shots. The first two sets, from CMU, contain 169 
faces in 42 images and 183 faces in 65 images, respectively. The set 
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Figure 7.2. Images from the FERET database 



from MIT consists of 23 images containing 155 faces. Figure 7.3 shows 
some of the images from these sets. The images of this database to- 
gether, with the groundtruth location of the faces, are available from 
http://www.cs.cmu.edu/People/har/faces.html. 

1.3 Face video database 

We have partially collected a database of video segments of head- 
and-shoulder scenes intended for the evaluation of algorithms for facial 
analysis. We recorded these videos with a hand-held camera aimed at 
the face of the person sitting in front of a computer. An stimulus se- 
quence was shown on the monitor of the computer to guide the user with 
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Figure 7.3. Images from the CMU/MIT database 



instructions and to automatically generate the temporal labelling of the 
events shown in the videos. 

This database consists of 54 video segments: 18 people with three dis- 
tinct video segment, each. The first segment, intended mainly for expres- 
sion recognition, consists of three repetitions of seven facial expressions 
in the following sequence: neutral, sadness, neutral, happiness, neutral, 
surprise, neutral, disgust, neutral, anger, neutral. The neutral expres- 
sion was held for approximately 2 s to separate the others that lasted 
approximately 3 s each for a grand total of 2600 frames. 
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The second set of video segments consists of the repetitions of facial 
gestures while showing three facial expressions: neutral expression, hap- 
piness, and anger. The facial gestures include head nodding, shaking, 
eye winking, and others. The total length of these segments is 2020 
frames. The last set of videos consists of 1350 frames of more relaxed 
gestures, expressions, and head motion. It is intended for face tracking 
and recognition with extreme facial expressions. 

All these videos were digitized in synchrony with the stimulus se- 
quence by detecting a high contrast change pattern that was displayed 
as part of the stimulus sequence several seconds in advance and recorded 
at the beginning of each segment. This pattern was easily detected by 
thresholding the frame difference, and the extra frames at the beginning 
of the sequence were skipped. We compressed and stored the videos in 
MPEG1 format, 320 x 240 color pixels, at 30 frames per second at a 
rate of approximately 1 Mbits/s. Figures 7.17-7.21 (pages 65-69) show 
several examples of frames of these video segments and the results of our 
facial feature detector and tracker. 

2. Face Detection in Complex Backgrounds 

We have tested a simple version of our face detection algorithm with 
the CMU/MIT database, and compare our learning technique to that of 
the neural-network-based face detection approach reported in [21, 22], 

In order to compare the performance of the visual pattern recogni- 
tion, we trained the face-to-background classifier with only 11x11 pix- 
els so that faces as tiny as 8 x 8 pixels present in this database could 
be detected. We used face examples from the FERET database and 
background examples from another collection of general images. Note 
that these results do not reflect those of the complete system, not only 
because of the low resolution of the classifier, but also because we turned 
off the postprocessing of the face candidates and the face validation with 
the facial feature detection. 

In our test, a face candidate obtained from the detection procedure 
was considered correct if it was in the correct scale and if the error 
in the position of the face (with respect to the ground truth) was less 
than 10% of the size of the face. All other candidates are marked as 
false detections. Figure 7.4 shows the receiver operating characteristic 
(ROC) of the system obtained with the CMU database; the vertical axis 
shows the correct detection rate, while the horizontal one shows the false 
detection rate. 

receiver operating characteristic 

Table 7.1 shows the performance of our system together with that of 
neural-network-based system reported in [22] with two different network 
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Figure 7.4 . Face detection ROC with CMU database 



Table 7.1. Comparison between our face detection system and that reported in [1] 



System: 


Detected 


Detection 


False 


False Alarm 


Description 


Faces 


Rate 


Faces 


Rate 


Our system (threshold 1) 






12758 




Our system (threshold 2) 




93.9% 


8122 


1/3243 


Our system (threshold 3) 




89.9% 


7150 


1/3684 


Our system (threshold 4) 




86.8% 


6133 


1/4294 


Neural Network 1 (2905 conn.) 


470/507 




1768 


1/47002 


Neural Network 2 (4357 conn.) 


466/507 


91.9% 


1546 


1/53751 


Neural Network 3 (2905 conn.) 


463/507 


91.3% 


2176 


1/38189 


Neural Network 4 (4357 conn.) 


470/507 


92.7% 


2508 


1/33134 



architectures. The first has a total of 2905 connections while the second 
has 4357. 

Note that our system produced about three times more false alarms for 
about the same detection rate. On the other hand, although it is difficult 
to make a rigorous comparison of the computational requirements of 
these two face detection approaches, we roughly estimate our method to 
be at least two orders of magnitude faster. 

Face detection is achieve by testing a large number of subwindows 
with face classifiers. Our face detection approach differs from the neural- 
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network-based approach not only in the speed of the subwindow test, 
but also in the number of subwindows tested. In order to estimate the 
difference in computational requirements of these two face detection ap- 
proaches, we first estimate the ratio between the numbers of operations 
required by these classifiers to test each subwindow. Then, we estimate 
the ratio between the total numbers of subwindows that these two clas- 
sifiers need to test to find faces with the same range of size since the 
scaling factors in the multiscale search are not equal. 

The two network architectures have 4357 and 2905 connections, re- 
spectively. Neglecting the requirements of the activation functions, and 
assuming that each connection requires one floating-point multiplica- 
tion, one floating-point addition, and one “float” (4 bytes) to keep the 
weight factor, these systems require 8714 and 5810 floating-point oper- 
ations, and about 17 and 11 Kbytes of memory, respectively. On the 
other hand, assuming that 80% of the 11x11 pixels are used in the 
likelihood distance, our technique requires about 100 fixed-point addi- 
tions and 1600 “shorts” (about 3 Kbytes) to hold the pre-computed log 
likelihood table of (2.17). Disregarding that floating-point operations 
require either more hardware or more CPU time than fixed-point oper- 
ations, and the effect of cache because the data used by our algorithm 
is five times smaller, we estimate our classifiers to test each subwindow 
between 58 and 87 times faster. 

Suppose that we use these systems to search for faces between S\ = 20 
and S 2 = 200 pixels in size, and that the input image is large compared 
to the subwindow size Sw> that is, W,H 147s. In a multiscale search 
approach with scale factor a, the total number of windows tested can 
be approximated by 

n 2 , 

N « WH ^ H 2k (7.1) 

fc=ni 

where m and 712 are computed from rife = ln(Sk/Sw)/lna. 

Considering that our system uses the scale factor a = y/2 and the 
sub window size Sw = 11) and that the neural-network-based implemen- 
tation uses a = 1.2 and Sw = 20, we estimate the ratio between the 
number of sub windows required to be tested to be 3.26. 

Overall, combining the ratio between the number of operations re- 
quired by these face detection approaches to test each subwindow with 
the classifiers (from 58 to 87) and the ratio between the total num- 
ber of subwindows tested in these approaches (approximately 3.26), we 
estimate our face detection system to be between 189 and 283 times 
faster. A more detailed comparison should also take into account the 
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preprocessing step, which adds additional computation to the neural- 
network-based system. 

2.1 Further comparison issues 

While we have compared the aforementioned detection systems with 
the same test database, there are a number of issues that prevent this 
comparison from being truly useful. 

1 The training sets used in these two approaches are different in both 
positive and negative examples. But most importantly, in our case, 
the training set is far too different from the test set; the face images 
from the FERET database are noiseless and the faces are in near- 
perfect frontal view pose, while the CMU/MIT database includes a 
wide variety of noisy pictures, cartoons, and faces in different poses. 
Figures 7.1 and 7.3 show examples of these databases. 

2 The preprocessing algorithms used in these two approaches are far 
too different. In [22], nonuniform light conditions are compensated 
by linear fitting. The better the preprocessing algorithm, the bet- 
ter the performance of the learning technique. However, since such 
algorithms have to be applied to each of the tested subwindows be- 
fore they are fed to the classifier, complex preprocessing algorithms 
introduce an extremely large amount of computation, especially in a 
multiscale search scheme. In our implementation, aimed at real-time 
operation, we used a simple histogram equalization procedure as the 
preprocessing step and left the classifier with the task of dealing with 
the variations in illumination. 

3 The scale ratios in the multiscale detection schemes in these two tech- 
niques, and therefore the scale variations handled by the classifiers, 
are not the same. With less scale variation the classifiers are expected 
to perform better; however, a greater number of scaled images must 
be used to find faces with similar size, resulting in another increase 
of computation. 

4 The size of the subwindow used to feed the classifiers reflects the 
amount of information available to make the decision. Larger subwin- 
dows are expected to perform better. For this evaluation, however, 
we used a window of 1 1 x 11 pixels, mainly because of the presence 
of faces as tiny as 8 x 8 pixels, while the system in [22] was reported 
to use a window of 20 x 20 pixels. 
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3. Facial Feature Detection and Tracking 

In this section, we first present a detailed study of the information- 
based maximum discrimination classifiers by analyzing their performance 
on the problem of facial feature detection. Then, we show examples of 
the facial feature detection and tracking on selected frames from the face 
video database. 

We have labelled by hand the location of the facial features in 243 
selected frames which show the most distinctive expressions found on the 
first set of videos of each person in the face video database. We trained 
the classifiers using one half of the labelled images at three different 
scales and rotations, and tested them on the other half at randomly 
selected scales and rotation angles. 

3.1 Facial feature detection 

In order to evaluate the performance of these classifiers, we measure 
the inaccuracy the detector and compute the receiver operating charac- 
teristic (ROC) of the classifiers. The location error inaccuracy of the 
detector is measured as the average distance in pixels between the hand- 
selected location of the features and the peak location of the classifier 
response in the search areas. Note that this average error does not mea- 
sure the confidence level of the classification. The performance of the 
facial feature detectors, on the other hand, is studied from the ROC of 
the classifiers. The criterion for errors in the location of the features is 
10% of the distance between the eyes. 

In addition to computing ROC of the classifiers, we extracted two 
measures to ease the comparison between different classifiers: (i) the 
maximum detection rate, or the top-1 detection performance, and (ii) 
the average detection rate for a given range of false detection rates. 
Note that the former is commonly used to measure the performance of 
an object detector regardless of the confidence level of the detection. 
The later measure, obtained from the area under the ROC plot on the 
region where the system is actually operated, is more powerful to select 
the classifier that is best for the application in question; in our case, the 
real-time face and facial feature tracker operates at a 1 % false detection 
rate. 

Three sets of experiments were conducted to evaluate the information- 
based maximum discrimination learning technique. The first set is used 
to compare classifiers trained with different values of the bootstrapping 
mixing factor /?. This mixing factor reflects the weight given to the error 
bootstrapping of the negative examples. Tables 7. 2-7.4 show the results 
obtained with each of the feature detectors in terms of the accuracy, 
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Table 7.2. Facial feature detection accuracy under error bootstrapping of background 
examples. [1]: right eye corner, [2]: left eye corner, [3]: right corner of the right 
eyebrow, [4]: right corner of the left eyebrow, [5]: left corner of the right eyebrow, 
[6]: left corner of the left eyebrow, [7]: nostril center, [8]: right mouth corner, [9]: left 
mouth corner. 
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Table 7.5. Top-1 facial feature detection performance under error bootstrapping of 
background examples. [1]- [9] : see descriptions at Table 7.2. 
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top-1 detection performance and the average detection rate for 1% false 
alarm, respectively. Figures 7.5-7.8 show the comparison of the ROCs of 
the detectors of four facial features. These four facial features are (i) the 
outer corner of the right eye, (ii) the right comer of the right eyebrow, 
(iii) the center of the nostrils, and (iv) the right corner of the mouth. 

The second set of experiments is used to compare classifiers trained 
with different values of the bootstrapping mixing factor a. The mixing 
factor a reflects the weight given to the error bootstrapping of the pos- 
itive examples of the features. Tables 7. 5-7. 7 show the results obtained 
with each of the feature detectors in terms of the accuracy, top-1 de- 
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Table 7.4 ■ ROC area of facial feature detection under error bootstrapping of back- 
ground examples. [l]-[9]: see descriptions at Table 7.2. 
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Figure 7.5. Feature detection ROC under error bootstrapping of background exam- 
ples: corner of the right eye 



tection performance and the average detection rate for 1% false alarm, 
respectively. Figures 7.9-7.12 show the comparison of the ROCs of the 
detectors of the four facial features mentioned above. 

The third set of experiments is used to compare classifiers trained 
with different values of the weight A in the computation of the diver- 
gence given in Eq. (2.19). Tables 7.8-7.10 show the results obtained 
with each of the feature detectors in terms of the accuracy, top-1 de- 
tection performance and the average detection rate for 1% false alarm. 
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Table 7.5. Feature detection accuracy under error bootstrapping of feature examples. 
[l]-[9j: see descriptions at Table 7.2. 
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Table 7.6. Top-1 facial feature detection performance under error bootstrapping of 
feature examples. [l]-[9]: see descriptions at Table 7.2. 
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Table 7. 7. ROC area of feature detection under error bootstrapping of facial feature 
examples. [l]-[9]: see descriptions at Table 7.2. 
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Figure 7.6. Feature detection ROC under error bootstrapping of background exam- 
ples: right corner of right eyebrow 




Figure 7.1. Feature detection ROC under error bootstrapping of background exam- 
ples: center of nostrils 

respectively. Figures 7.13-7.16 show the comparison of the ROCs of the 
detectors of four facial features mentioned above. 





88 



Facial Analysis From Continuous Video With Applications 



Table 7.8. Feature detection accuracy under divergence weight. [l]-(9] : see descrip- 
tions at Table 7.2. 
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Table 7.0. Top- 1 feature detection performance under divergence weight. [l]-[9] : see 
descriptions at Table 7.2. 
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Table 7.10. ROC area of feature detection under divergence weight. [l]-[9]: see 
descriptions at Table 7.2. 
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0.95 


0.90 


0.87 


0.80 


0.81 


0.86 


0.99 


0.94 


0.83 


0.10 


0.95 


0.91 


0.91 


0.83 


0.82 


0.88 


0.99 


0.94 


0.85 


0.15 


0.96 


0.91 


0.90 


0.83 


0.84 


0.92 


0.99 


0.93 


0.83 


0.20 


0.97 


0.91 


0.88 


0.84 


0.83 


0.88 


0.99 


0.93 


0.83 


0.25 


0.96 


0.91 


0.90 


0.83 


0.86 


0.91 


0.99 


0.93 


0.83 


0.30 


0.95 


0.91 


0.88 


0.83 


0.84 


0.91 


0.99 


0.93 


0.84 


0.35 


0.96 


0.91 


0.85 


0.85 


0.85 


0.90 


0.99 


0.93 


0.83 


0.40 


0.96 


0.91 


0.83 


0.84 


0.83 


0.90 


0.99 


0.94 


0.84 
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Figure 7.8. Feature detection ROC under error bootstrapping of background exam- 
ples: right corner of mouth 
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Figure 7.9. Feature detection ROC under error bootstrapping of feature examples: 
corner of the right eye 
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Figure 7.10. Feature detection ROC under error bootstrapping of feature examples: 
right corner of right eyebrow 




Figure 7.11. Feature detection ROC under error bootstrapping of feature examples: 
center of nostrils 
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Figure 7.12. Feature detection ROC under error bootstrapping of feature examples: 
right corner of mouth 
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Probability of false alarm 



Figure 7.13. Feature detection ROC under divergence weight: corner of the right 
eye 




Probability of false alarm 



Figure 7.14. Feature detection ROC under divergence weight: right corner of right 
eyebrow 






IMPLEMENTATIONS, EXPERIMENTS AND RESULTS 



93 




Probability of false alarm 



Figure 7.15. Feature detection ROC under divergence weight: center of nostrils 




Probability of false alarm 



Figure 7.16. Feature detection ROC under divergence weight: right corner of mouth 

From the performance comparisons of these classifiers, several obser- 
vations can be made. First, the nostrils are by far the easiest facial 
feature to detect, followed by the outer comers of the eyes. We take ad- 
vantage of this by detecting the facial features in a hierarchical scheme 
in which the center of the nostrils is first found with a larger search area, 
and then its position is used to help detect the rest of the facial features. 
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This hierarchical scheme improves the overall performance and speeds 
up the overall facial feature detection by reducing the search areas of 
most of the features. 

From the comparison of the performance under different values of the 
mixing factor a and (3 of the error bootstrapping learning step, it is clear 
and consistent among all facial features that the bootstrapping step of 
the positive examples, i.e., a > 0, does not improve the performance, 
while bootstrapping negative examples, (3 > 0, does show significant im- 
provement. We believe that this is because the negative examples are 
more sparse in the space of images and the mixture of the two proba- 
bilities models does help in fitting the data. Since the value of (3 that 
maximizes this improvement is not consistent among all the classifiers, 
an iterative training algorithm with some validation routine is required 
to take full advantage of this form of error bootstrapping. 

Similarly, from the comparison of the results of the use of different 
weighting values A in the computation of the divergence, we found con- 
sistent improvement with A € [.10, .30]. Note that this bias towards the 
distribution of the positive examples in the computation of the diver- 
gence from Eq. (2.19) consistently tells us that the distribution of the 
negative examples does not fit the training data well, and that other 
techniques such as error bootstrapping can be used to improve it. 

3.2 Face and facial feature tracking 

We have tested our face and facial feature detection and tracking sys- 
tem with all the video segments in the face video database. The result 
of this system consists of the trajectory of the facial features together 
with the confidence level of the overall tracking and of the individual 
facial features. The confidence level threshold of each facial feature has 
been set to 1% according to the subset of image frames used in the 
previous section. Out of the 107460 image frames of the complete video 
database, in 105417 (98%) frames the face was successfully tracked, with 
large rotation in depth being the most significant cause of errors. Fig- 
ures 7.17-7.21 show several examples of this tracking results at selected 
frames. 

4. Face and Facial Expression Recognition 

We tested the proposed algorithm for embedded face and facial ex- 
pression recognition using the first set of video segments of our face video 
database. We trained one model for each person using the first two thirds 
of the frames of the first set of video segments in our video database and 
left the rest of the frames for testing. We evaluated our algorithm with 
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Figure 7.17. Examples of the facial feature tracking 



each of the test frames independently. For each test frame, we computed 
the likelihood of the six facial expressions for the 18 people. 

Face and facial expression recognition were carried out using max- 
imum likelihood decisions, but the latter was tested only in a person 
dependent scenario. We approximate the likelihood of the face from 
Eq. (5.2) with the likelihood of the facial expression model that is most 
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Figure 7.18. Examples of the facial feature tracking 



likely. The face recognition algorithm selects the person’s model and the 
facial expression that maximizes the likelihood of the test image. In a 
person-dependent scheme, the expression recognition algorithm simply 
selects the facial expression with maximum likelihood. 

The face models used for embedded face and facial expression recog- 
nition are based on nine facial features divided into four facial regions. 
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Figure 7.19. Examples of the facial feature tracking 



Figure 5.1 illustrates these regions and facial features. The appearance 
of these facial features consists of an image subwindow around the fea- 
ture position. The subwindow size and position of each facial feature 
were selected by hand. Figure 7.22 shows examples of these feature sub- 
windows for each of the six facial expressions of four subjects of our 
video database. 
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Figure 7.20. Examples of the facial feature tracking 



4.1 Face recognition 

Figures 7.23-7.25 show comparisons of face recognition performances 
obtained using facial feature appearances and positions independently 
and in combination. These ROCs show that the positions of the center 
of the nostrils and the corner of the eyebrows do not improve the perfor- 
mance of their corresponding feature appearances for face recognition. 
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Figure 7.21. Examples of the facial feature tracking 



However, the combination of the appearances and positions of the mouth 
corners did show significant improvement. 

Figure 7.26 shows the improvement obtained in face recognition per- 
formance by combining the position of the mouth comer with the ap- 
pearances of the eye and nose region and the mouth region. While most 
face recognition systems exclude the mouth region and report it to be 
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Figure 7.22. Examples of the facial feature image subwindows 




Figure 7.23. Face recognition performance comparison of the eye and nose region 



a poor performer, our algorithm improves face recognition performance 
using a combination of appearance and geometry of the mouth. The 
mouth appearance is not only normalized by the position of the mouth 
corners, but these positions are also used as part of the similarity mea- 
sure used for face recognition. 

Figure 7.27 shows the face recognition performance of the region of the 
eyebrows independently and in combination with the rest of the facial 
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Figure 7.24 • Face recognition performance comparison of the eyebrow region 




Figure 7.25. Face recognition performance comparison of the mouth region 



regions. The combined use of the appearance of all facial feature regions 
results in the best face recognition performance. This system achieved a 
99 % correct recognition rate with less than a 5 x 10~ 8 % false recognition 
rate on our 18-person video database. 

These experiments and results have shown that detecting and tracking 
facial features improves face recognition algorithms. The performance 
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Figure 7.26. Face recognition performance improvement by mouth feature positions 




Figure 7.27. Face recognition performance improvement by eyebrow feature appear- 
ances 

improvement achieved by our face recognition system is the result of 
three fundamental ideas. First, we use facial feature locations to nor- 
malize facial appearance with respect to nonrigid deformation of the 
facial expressions. Second, we use the facial expression deformation, 
especially that of the mouth, as part of the similarity measure of the 
face recognition system. Finally, we incorporate a facial expression hid- 
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Table 7.11. Expression recognition performance using only feature positions 





Neutral 


Sadness 


Happiness 


Surprise 


Disgust 


Anger 


Neutral 


4457 


85 


65 


55 


67 


50 


Sadness 


908 


501 


29 


15 


43 


45 


Happiness 


863 


41 


810 


21 


37 


23 


Surprise 


626 


14 


21 


612 


25 


17 


Disgust 


827 


54 


41 


30 


541 


107 


Anger 


695 


69 


34 


25 


119 


554 


Error (%) 


46.7 


34.4 


19.0 


19.2 


34.9 


30.4 



den variable in our face models so that face recognition is carried out 
independently of facial expressions. 

Although we do not report results using HMMs to model the tempo- 
ral nature of facial expressions, further improvement in face recognition 
performance is expected by the constraints imposed by the transition 
matrices associated with the facial expressions. The use of HMMs is 
also useful in the learning stage if facial expression recognition is not 
required. In this case, the expectation maximization (EM) algorithm 
can be used for automatic self-clustering of facial expressions so that no 
manual labelling is required on the training video segments. 

4.2 Facial expression recognition 

We evaluated the performance of our facial expression recognition 
approach using confusion matrices. Figure 7.28 shows several example 
frames of our video database and the result of our facial expression recog- 
nition system. The bars indicate the likelihood of each expression model, 
and the most likely expression is also shown on the right. Table 7.11 
shows the results obtained in facial expression classification using only 
facial feature positions. Table 7.12 shows the classification performance 
obtained by using the facial feature positions and appearances. It is 
interesting that using only these nine facial feature positions alone gives 
reasonably good performance. Including feature appearances together 
with the feature positions shows significant improvement in facial fea- 
ture recognition. These results are particularly good considering that for 
most of our test subjects, it was difficult to fake the facial expressions 
consistently. 

We also measured the expressiveness of the facial feature regions by 
comparing the recognition performance obtained with each of the facial 
expressions independently. Figures 7.29-7.34 show these comparisons us- 





104 



Facial Analysis From Continuous Video With Applications 




Figure 7.28. Sample frames from our Face Video Database and the results of the 
facial expression recognition system. 



Table 7.12. Expression recognition performance using feature positions and images 





Neutral 


Sadness 


Happiness 


Surprise 


Disgust 


Anger 


Neutral 


7049 


64 


51 


33 


60 


28 


Sadness 


337 


633 


3 


6 


44 


47 


Happiness 


292 


16 


932 


9 


6 


3 


Surprise 


125 


2 


6 


700 


3 


0 


Disgust 


301 


21 


0 


5 


661 


54 


Anger 


272 


28 


8 


5 


58 


664 


Error (%) 


15.8 


17.2 


6.8 


7.6 


20.5 


16.5 



ing the combination of facial feature appearances and positions. Figures 
7.35-7.40 show these comparisons using only facial feature positions. 

These results are consistent with empirical observations. For example, 
Figure 7.30 shows that the eyebrow region is the most useful to detect 
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Figure 7.29. Neutral expression recognition performance of facial regions using fea- 
ture positions and appearances 




Figure 7.30. Sadness expression recognition performance of facial regions using fea- 
ture positions and appearances 



of the expression of sadness. Figure 7.36 also shows that the regions of 
the mouth and eyebrows are the most useful to detect the expression of 
sadness. Similarly, Figure 7.31 indicates that the expression of happiness 
(smile) is best detected using the mouth region. 
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Probabiity of false alarm 

Figure 7.31. Happiness expression recognition performance of facial regions using 
feature positions and appearances 




0.01 0.1 1 
Probability of false alarm 



Figure 7.32. Surprise expression recognition performance of facial regions using fea- 
ture positions and appearances 

4.3 Model-based Video Coding 

Our proposed head and feature tracking system was tested on several 
real and synthetic video sequences. The synthetic sequences were created 
using a Cyberware Head Scan and a pre-specified motion path, so that 





IMPLEMENTATIONS, EXPERIMENTS AND RESULTS 



107 




Figure 7.SS. Disgust expression recognition performance of facial regions using fea- 
ture positions and appearances 




Figure 7.34 ■ Anger expression recognition performance of facial regions using feature 
positions and appearances 



the ground truth for the angles, scale, and translation was known. In 
the real sequences the frames were grayscale, 320x240 and were captured 
at 30fps. In Figure 7.41 we see the original synthetic sequence side-by- 
side with the texture mapped head model rendered at the computed 
pose. Since we know the motion in the synthetic case, we can compare 
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Figure 7.S5. Neutral expression recognition performance of facial regions using fear 
ture positions alone 




Figure 7. 36. Sadness expression recognition performance of facial regions using fear 
ture positions alone 



the recovered angles and scale directly with the ground truth. Figure 
7.42(a) shows plots of the recovered angles (optimal, filtered, and true 
values). It is also interesting to examine the accuracy of the Kalman filter 
predictions for the 3D pose. Figure 7.42(b) show a plot of the predicted 
and actual angles for the Y axis using the Kalman filter model. 
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Figure 7.37. Happiness expression recognition performance of facial regions using 
feature positions alone 




Figure 7.38. Surprise expression recognition performance of facial regions using fear 
ture positions alone 

For the real video experiments, ground truth values are not known. 
However, a visual comparison can be made to see how accurately the 
computed angles and scale follow the 3D motion of the head. In Figure 
7.46 we see the original sequence side-by-side with the texture mapped 
head model rendered at the computed pose. Figure 7.44(a) shows plots 
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Figure 7. 39. Disgust expression recognition performance of facial regions using fear 
ture positions alone 




Figure 7.40. Anger expression recognition performance of facial regions using feature 
positions alone 



of the recovered angles for one of the test sequences (optimal and filtered 
values). Figure 7.44(b) shows plots of the predicted pose and the actual 
computed pose angles for the same sequence. Finally, in Figure 7.43, 
we can see the 3D wireframe model overlayed on the original sequence 
using the computed pose. 
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Figure 7.42. Results of pose recovery for Synthetic Sequence, (a) Optimal, filtered, 
and true measurements - Top: 9x Bottom: Qy. (b) Predicted and true measurements 
for Y angle. 
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Figure 7.43. 3D Model Wireframe tracking. Wireframe mesh is overlayed on the 
original images using the recovered pose. 

We can also evaluate the performance of our system as a facial feature 
tracker. We show the results of tracking features in 2D that undergo 
global motion as well as global + non-rigid local motion, Figure 7.45. 
The rigid points were the outer eye corners, nose base, and middle of 
the nose bridge. Non-rigid points were the tips of the eyebrows and the 
mouth comers. 




Figure 7.44 ■ Results of pose recovery for Real Sequence A. (a) Optimal and filtered 
measurements - Top: Ox Bottom: Oy. (b) Predicted and Optimal measurements for 
Y angle. 

Finally, we examine the computation of the 3D motion vectors for non- 
rigid points. As explained earlier, these vectors are computed from the 
2D trajectories of non-rigid features points during the tracking. Motion 
is constrained in 3D to lie on the surface of the head model using a locally 
planar surface assumption. Figure 6.10 shows the results for the eyebrow 
and mouth corners. The recovered 3D vectors have been applied to the 
well-known Parke head model [103] to synthesize the non-rigid motion. 
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Figure 7.45. Feature tracking results, selected frames from Sequence B. 
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Figure 7-46 • Visual Comparison of recovered pose, selected frames from Sequence A. 
Left: Original frame, Right:rendered model. 
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Figure 7-47. (a) Reference frame from video sequence, (b) Neutral head model, (c) 

Expression at non-frontal pose, (d) Synthesized expression 



Chapter 8 



APPLICATION IN AN AUDIO-VISUAL 
PERSON RECOGNITION SYSTEM 



The original system in [60] has been successfully ported from the 
initial Silicon Graphics Onyx platform with 12 R10000 processors to 
a Pentium III 600 MHz processor so that the detection speech can be 
evaluated and compared with other systems. For size 320 x 240 real 
time video, the system in [60] achieves a detection rate of 5 ~ 6 frames 
per second(fps). That system has been known for its near real time 
performance. Our recent effort speeds it up to obtain an 11 ~ 12 fps 
detection without sacrificing any detection robustness. In comparison, 
Viola et al’s face detection based on AdaBoost achieves a detection rate 
of 15 fps for 384 x 288 video on a Pentium III 700 MHz processor [61]. 

The face and facial features detection algorithms are used as the ker- 
nel of a recently developed multi-modal person identification system [62]. 
Only face detection and outer eye comer detection are used for the face 
recognition task since an accurate outer eye comer detection can enable 
the faces to be normalized for recognition. The omission of the detec- 
tion of other facial features also makes the system faster. The computer 
interface of the multi-modal person recognition system is shown in Fig- 
ure 8.1. The upper-left comer displaces the real-time video captured by 
a digital camcorder mounted on a tripod behind the computer screen. 
The upper-center displays text or digits for the user to read in order to do 
speaker identification. At the upper-right comer there are three buttons 
titled ’’Start Testing”, ’’Add User”, ’’Delete User” which indicate three 
functionalities. Two bar charts are in the lower-left comer displaying the 
face recognition and speaker ID likelihood respectively for each user. In 
the lower-center, icon images of users that are currently in the database 
are shown in black and white and the recognized person has his/her im- 
age enlarged and shown in color form. The lower-right displays all the 
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Figure 8.1. The computer interface of the multi-modal person recognition system. 



names of the users currently in the database. In the status bar at the 
bottom of the window, the processing speed in frames-per-second(fps) 
is also shown. 

1. Speaker Identification Based on MFCC and 
GMM 

In the mean time, a speaker identification system is also developed 
running in parallel with face recognition. The speaker ID system uses 
Mel-frequency cepstrum coefficients(MFCC)[63] as low level features and 
models each user’s speech as a Gaussian Mixture Model(GMM)[64], 
Maximum likelihood is used as classifier to classify a new user’s speech 
captured by an on-the-desk microphone to be from one of the users in 
the database. Both the likelihoods from K-nearest neighbor classifier 
and maximum likelihood classifier are displayed in the two bar charts 
respectively in Figure 8.1. The final decision is made by simply the 
multiplication of the two sets of probabilities. 
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2. Online Training of both Face Model and 
Speaker Model 

The system also supports online training. A new user who has not yet 
been in the database can have his/her model added to the database. 30 
seconds of training speech is recorded to train his/her GMM model after 
the user issues the ’’Add User” command and reads the text prompted 
on the screen. MFCC+GMM for speaker ID is not language dependent. 
The experimental subject can speak different languages both in training 
and testing. 

While the user is reading the text or digits, face detection detects 
his/her face and prepares normalized faces for training his/her face 
model. The features of the new user(s) are extracted by projecting 
the faces to the PCA space trained using the original 24 users. The new 
user’s face image is added to the icon image set in the lower-center of 
Figure 8.1. The two bar charts and the list of user names are also up- 
dated to reflect the addition of a new user. After 30 seconds of training, 
the user can ask the system to test whether he/she can be recognized by 
the system. The user can also have his/her model deleted or replaced 
with a new one. 

3. Experimental Results 

Testing of face detection results in real-time video is done by turning 
off the functionality of face recognition and speaker ID in Figure 8.1. 42 
subjects have been tested for face detection in real-time video. Several 
video sequences of each subject have been taken on different days un- 
der different lighting conditions. Different sizes of faces have also been 
tested. Each sequence is between 1 minute and 5 minutes long. 

Face detection results consistently show high detection accuracy (~ 
95%) and very low false alarm rate(< 0.1%). Given the near real-time 
face detection, a pattern of face tracking by the green square and the 
red crosses can be observed. 

Experiments on face detection+face recognition alone without speaker 
identification and vice versa have shown nearly perfect recognition preci- 
sion. Each of the 24 original subjects whose models are used in training 
is correctly recognized using one modality alone. The combination of two 
modalities shows 100% accuracy as a result of both strong recognition 
in each individual modality. 

The person recognition system recognizes both users who have been 
already in the database and the new users who have been added to the 
database by online training with very high accuracy. In one demonstra- 
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tion, a set of 22 new people were added to the database one by one and 
all of them have been correctly recognized. 




Chapter 9 



CONCLUSIONS 



In this book, we have described a collection of computer vision and 
pattern recognition algorithms for facial analysis with applications to 
human-computer intelligent interface. We have reported solutions from 
face detection in complex backgrounds to face tracking, and from facial 
motion analysis to face and facial expression recognition. We have also 
presented a novel probabilistic framework for embedded face and facial 
expression analysis from video segments. In this book, we have also 
dealt with the issues regarding the implementation and integration of 
these algorithms into a fully automatic, real-time facial analysis system. 

Figure 9.1 (a reiteration of Figure 1.1) illustrates schematically the 
four components of the facial analysis system reported here. First, the 
system has a fast, automatic face and facial feature detection algorithm. 
Next, there is a real-time face tracking module capable of locking onto 
multiple people and tracking nine nonrigid facial features. The third 
system component implements rule-based algorithms for the analysis of 
the global head/face position to detect facial gestures such as head shak- 
ing and nodding. Finally, the last component of our system implements 
an algorithm for embedded face and facial expression recognition. 

While individual components of our facial analysis system find appli- 
cations in other fields such as surveillance, video games and model-based 
coding, the driving application of our research was human-computer in- 
telligent interface. The system input consists basically of real-time video 
of head-and-shoulder scenes. This kind of video is obtained, for example, 
from a camera placed on top of a computer monitor and aimed at the 
user of the computer. The output of the system is a family of high-level 
events related to the faces of the users in the field of vision. These vision- 
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Figure 9.1. Facial analysis for intelligent human-computer interface: an overview 

based events include the presence, location, activity, history, gestures, 
expressions and identity of the computer's users. 

It is the general conclusion of this book that computer systems will 
soon benefit from vision-based algorithms for facial analysis and under- 
standing. Reliable, real-time solutions are becoming readily available 
from the research community. At the same time, computing power is 
constantly increasing, and imaging technologies are being produced mas- 
sively and inexpensively. 

Human-computer interface will have to depart from its current form 
in order to accommodate higher-level input events. Current operating 
systems and application interfaces rely uniquely on input events from 
keyboards, mice or similar mechanical devices, and even speech. How- 
ever, facial analysis systems such as that reported in this book produce 
high-level events such as There are two people in the field of vision. Per- 
son 0, of unknown identity, is located at xo. Person 1, identified as 
Antonio with confidence level 80%, is located at xi . Even higher-level 
events are possible, such as Person 1 shows disgust with confidence level 
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40% and anger with 45%. While interpretation and reaction to these 
sorts of input events is being studied, it is not clear how high-level ma- 
chine understanding will improve computing as we know it today. 
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